跳转至

第六章:健康检查

健康检查类型

HTTP 检查

# HTTP 健康检查
check {
  id = "http-check"
  name = "HTTP Health Check"
  http = "http://localhost:8080/health"
  interval = "10s"
  timeout = "1s"
  deregister_critical_service_after = "30s"
}

TCP 检查

# TCP 健康检查
check {
  id = "tcp-check"
  name = "TCP Health Check"
  tcp = "localhost:8080"
  interval = "10s"
  timeout = "1s"
}

TTL 检查

# TTL 健康检查
check {
  id = "ttl-check"
  name = "TTL Health Check"
  ttl = "30s"
  deregister_critical_service_after = "30s"
}

gRPC 检查

# gRPC 健康检查
check {
  id = "grpc-check"
  name = "gRPC Health Check"
  grpc = "localhost:50051"
  grpc_use_tls = false
  interval = "10s"
}

Docker 检查

# Docker 健康检查
check {
  id = "docker-check"
  name = "Docker Health Check"
  docker_container_id = "container-id"
  shell = "/bin/bash"
  script = "/health-check.sh"
  interval = "10s"
}

服务健康检查

配置文件

# service.hcl
service {
  name = "user-service"
  id = "user-service-1"
  address = "192.168.1.10"
  port = 8080

  check {
    id = "user-service-http"
    name = "HTTP Health Check"
    http = "http://192.168.1.10:8080/health"
    interval = "10s"
    timeout = "1s"
  }

  check {
    id = "user-service-ttl"
    name = "TTL Health Check"
    ttl = "30s"
  }
}

HTTP API 注册

# 注册带健康检查的服务
curl -X PUT http://localhost:8500/v1/agent/service/register -d '{
  "Name": "user-service",
  "ID": "user-service-1",
  "Address": "192.168.1.10",
  "Port": 8080,
  "Check": {
    "HTTP": "http://192.168.1.10:8080/health",
    "Interval": "10s",
    "Timeout": "1s"
  }
}'

TTL 检查使用

# 注册 TTL 检查
curl -X PUT http://localhost:8500/v1/agent/service/register -d '{
  "Name": "user-service",
  "ID": "user-service-1",
  "Check": {
    "TTL": "30s"
  }
}'

# 定期发送心跳
curl -X PUT http://localhost:8500/v1/agent/check/pass/user-service-1:ttl

# 标记失败
curl -X PUT http://localhost:8500/v1/agent/check/fail/user-service-1:ttl

查询健康状态

HTTP API

# 查询所有健康检查
curl http://localhost:8500/v1/agent/checks

# 查询特定服务的健康状态
curl http://localhost:8500/v1/health/service/user-service

# 只查询健康的服务
curl http://localhost:8500/v1/health/service/user-service?passing

# 查询节点健康状态
curl http://localhost:8500/v1/health/node/node1

CLI 查询

# 查看所有检查
consul catalog services

# 查看服务健康状态
consul catalog services -tags

# 检查服务健康
curl http://localhost:8500/v1/health/service/user-service?passing | jq

健康检查配置

检查参数

check {
  # 检查间隔
  interval = "10s"

  # 超时时间
  timeout = "1s"

  # 初始延迟
  # initial_status = "critical"

  # 失败后注销时间
  deregister_critical_service_after = "30s"

  # 成功阈值
  success_before_passing = 1

  # 失败阈值
  failures_before_critical = 3
}

服务定义

service {
  name = "user-service"

  # 多个健康检查
  check {
    id = "http-check"
    http = "http://localhost:8080/health"
    interval = "10s"
  }

  check {
    id = "tcp-check"
    tcp = "localhost:8080"
    interval = "5s"
  }
}

健康检查最佳实践

1. 检查端点设计

# /health 端点
@app.route('/health')
def health():
    # 检查数据库连接
    if not check_database():
        return 'Database unavailable', 503

    # 检查缓存连接
    if not check_redis():
        return 'Redis unavailable', 503

    return 'OK', 200

# /ready 端点(Kubernetes)
@app.route('/ready')
def ready():
    # 检查服务是否准备好接收流量
    if not is_ready():
        return 'Not ready', 503
    return 'OK', 200

2. 检查间隔设置

推荐配置:

- HTTP 检查:10-30 秒间隔
- TCP 检查:5-10 秒间隔
- TTL 检查:15-30 秒 TTL

注意事项:
- 间隔太短会增加负载
- 间隔太长会延迟故障发现

3. 失败处理

# 配置失败阈值
check {
  http = "http://localhost:8080/health"
  interval = "10s"

  # 连续失败 3 次才标记为 critical
  failures_before_critical = 3

  # 成功 1 次就标记为 passing
  success_before_passing = 1

  # critical 状态 30 秒后注销服务
  deregister_critical_service_after = "30s"
}

小结

健康检查要点:

  • 检查类型:HTTP、TCP、TTL、gRPC、Docker
  • 服务检查:配置文件、HTTP API
  • 查询状态:HTTP API、CLI
  • 配置参数:间隔、超时、阈值
  • 最佳实践:端点设计、间隔设置、失败处理

下一章我们将学习集群部署。