第六章:故障注入与恢复¶
测试系统的韧性是保障服务可靠性的重要环节。Istio 提供了故障注入和故障恢复能力。
故障注入¶
故障注入用于主动引入故障,测试系统的容错能力。
HTTP 延迟注入¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- fault:
delay:
percentage:
value: 100
fixedDelay: 7s
route:
- destination:
host: reviews
subset: v1
HTTP 中止注入¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- fault:
abort:
percentage:
value: 50
httpStatus: 500
route:
- destination:
host: reviews
subset: v1
组合故障¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- fault:
delay:
percentage:
value: 100
fixedDelay: 5s
abort:
percentage:
value: 10
httpStatus: 500
route:
- destination:
host: reviews
subset: v1
针对特定用户注入¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
x-test-user:
exact: "chaos-test"
fault:
delay:
percentage:
value: 100
fixedDelay: 7s
route:
- destination:
host: reviews
subset: v1
- route:
- destination:
host: reviews
subset: v1
故障恢复¶
超时配置¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v2
timeout: 10s
重试策略¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v2
retries:
attempts: 3
perTryTimeout: 2s
retryOn:
gateway-error,
connect-failure,
refused-stream,
503
retryOn 可选值:
- 5xx:所有 5xx 错误
- gateway-error:502, 503, 504
- reset:连接重置
- connect-failure:连接失败
- envoy-ratelimited:限流
- retriable-status-codes:自定义状态码
熔断器¶
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 5s
tcpKeepalive:
time: 7200s
interval: 75s
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 25
熔断参数说明:
| 参数 | 说明 |
|---|---|
consecutive5xxErrors |
连续 5xx 错误次数触发熔断 |
interval |
检测间隔 |
baseEjectionTime |
基础驱逐时间 |
maxEjectionPercent |
最大驱逐比例 |
minHealthPercent |
最小健康实例比例 |
连接池配置¶
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 5s
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
maxRequestsPerConnection: 10
故障恢复最佳实践¶
1. 超时设置原则¶
# 超时 = 服务处理时间 + 网络延迟 + 缓冲时间
# 建议:P99 延迟的 2-3 倍
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
timeout: 30s # 假设 P99 延迟是 10s
2. 重试策略原则¶
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
retries:
attempts: 3
perTryTimeout: 5s
retryOn: gateway-error,connect-failure,refused-stream
注意: - 重试次数不宜过多(建议 ≤ 3) - 设置合理的 perTryTimeout - 避免对非幂等操作重试
3. 熔断器配置¶
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
spec:
host: my-service
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 30 # 不要驱逐太多实例
混沌测试¶
使用 Chaos Mesh¶
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
app: reviews
delay:
latency: "100ms"
correlation: "50"
jitter: "10ms"
duration: "5m"
使用 Litmus¶
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
spec:
appinfo:
appns: default
applabel: "app=nginx"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
故障排查¶
检查配置¶
# 检查 VirtualService
istioctl analyze
# 检查代理配置
istioctl proxy-config cluster <pod-name>
# 检查路由
istioctl proxy-config route <pod-name>
# 检查监听器
istioctl proxy-config listener <pod-name>
调试命令¶
# 查看 Envoy 统计
kubectl exec <pod-name> -c istio-proxy -- curl localhost:15000/stats
# 查看集群配置
kubectl exec <pod-name> -c istio-proxy -- curl localhost:15000/clusters
# 查看当前配置
kubectl exec <pod-name> -c istio-proxy -- curl localhost:15000/config_dump
小结¶
本章介绍了:
- 故障注入:延迟、中止、组合故障
- 故障恢复:超时、重试、熔断
- 混沌测试:Chaos Mesh、Litmus
- 故障排查:istioctl 命令
下一章我们将学习多集群部署。