第八章:生产实践¶
性能优化¶
配置优化¶
# consul.hcl
# 性能配置
performance {
raft_multiplier = 1 # Raft 超时倍数
}
# 日志级别
log_level = "WARN"
# 检查配置
check_update_interval = "5m"
# 限制检查数量
limits {
http_max_conns_per_client = 200
rpc_max_conns_per_client = 100
}
资源配置¶
# 系统参数
# /etc/sysctl.conf
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 32768
net.ipv4.ip_local_port_range = 1024 65535
# 应用参数
# /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536
监控告警¶
Prometheus 监控¶
# prometheus.yml
scrape_configs:
- job_name: 'consul'
metrics_path: /v1/agent/metrics
params:
format: ['prometheus']
static_configs:
- targets:
- 'node1:8500'
- 'node2:8500'
- 'node3:8500'
关键指标¶
| 指标 | 说明 | 告警阈值 |
|---|---|---|
| consul_raft_leader | 是否是 Leader | - |
| consul_raft_peers | 集群节点数 | < 3 |
| consul_raft_commit_index | 提交索引 | 停滞 |
| consul_raft_apply | 应用操作数 | 异常下降 |
| consul_client_rpc | RPC 调用数 | 异常 |
| consul_client_rpc_failed | 失败 RPC 数 | > 0 |
| consul_health_service_status | 健康状态 | critical |
告警规则¶
groups:
- name: consul-alerts
rules:
- alert: ConsulNoLeader
expr: consul_raft_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Consul 无 Leader"
- alert: ConsulRaftPeerLoss
expr: consul_raft_peers < 3
for: 1m
labels:
severity: critical
annotations:
summary: "Consul 集群节点不足"
- alert: ConsulServiceCritical
expr: consul_health_service_status{status="critical"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "服务健康检查失败"
安全加固¶
TLS 配置¶
# consul.hcl
# 启用 TLS
ca_file = "/etc/consul/tls/ca.crt"
cert_file = "/etc/consul/tls/server.crt"
key_file = "/etc/consul/tls/server.key"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
ACL 配置¶
# consul.hcl
# 启用 ACL
acl {
enabled = true
default_policy = "deny"
enable_token_persistence = true
tokens {
master = "master-token"
agent = "agent-token"
}
}
# 创建策略
# consul acl policy create -name "service-read" -rules @policy.hcl
策略定义¶
# policy.hcl
# 服务读取权限
service_prefix "" {
policy = "read"
}
# 节点读取权限
node_prefix "" {
policy = "read"
}
# KV 写入权限
key_prefix "config/" {
policy = "write"
}
故障处理¶
常见问题¶
1. Leader 选举失败¶
# 检查 Raft 状态
consul operator raft list-peers
# 检查网络
ping <server-ip>
telnet <server-ip> 8300
# 查看日志
journalctl -u consul -f
2. 服务注册失败¶
# 检查 Agent 状态
consul members
# 检查服务注册
consul catalog services
# 检查健康状态
curl http://localhost:8500/v1/agent/checks
3. 性能问题¶
# 检查资源使用
top
iostat -x 1
# 检查网络延迟
consul operator raft list-peers
# 优化配置
# 减少 health check 频率
# 增加资源
故障恢复¶
# 1. 备份数据
consul snapshot save backup.snap
# 2. 停止服务
systemctl stop consul
# 3. 恢复数据
consul snapshot restore backup.snap
# 4. 重启服务
systemctl start consul
# 5. 验证状态
consul members
consul operator raft list-peers
最佳实践¶
1. 部署架构¶
2. 服务注册¶
3. 配置管理¶
4. 监控运维¶
小结¶
生产实践要点:
- 性能优化:配置优化、资源配置
- 监控告警:Prometheus、关键指标、告警规则
- 安全加固:TLS、ACL、策略定义
- 故障处理:常见问题、故障恢复
- 最佳实践:部署架构、服务注册、配置管理、监控运维
完成本教程后,你应该能够在生产环境中部署和管理 Consul 集群。