跳转至

第六章:故障注入与恢复

测试系统的韧性是保障服务可靠性的重要环节。Istio 提供了故障注入和故障恢复能力。

故障注入

故障注入用于主动引入故障,测试系统的容错能力。

HTTP 延迟注入

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - fault:
      delay:
        percentage:
          value: 100
        fixedDelay: 7s
    route:
    - destination:
        host: reviews
        subset: v1

HTTP 中止注入

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - fault:
      abort:
        percentage:
          value: 50
        httpStatus: 500
    route:
    - destination:
        host: reviews
        subset: v1

组合故障

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - fault:
      delay:
        percentage:
          value: 100
        fixedDelay: 5s
      abort:
        percentage:
          value: 10
        httpStatus: 500
    route:
    - destination:
        host: reviews
        subset: v1

针对特定用户注入

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        x-test-user:
          exact: "chaos-test"
    fault:
      delay:
        percentage:
          value: 100
        fixedDelay: 7s
    route:
    - destination:
        host: reviews
        subset: v1
  - route:
    - destination:
        host: reviews
        subset: v1

故障恢复

超时配置

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v2
    timeout: 10s

重试策略

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v2
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 
        gateway-error,
        connect-failure,
        refused-stream,
        503

retryOn 可选值: - 5xx:所有 5xx 错误 - gateway-error:502, 503, 504 - reset:连接重置 - connect-failure:连接失败 - envoy-ratelimited:限流 - retriable-status-codes:自定义状态码

熔断器

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 5s
        tcpKeepalive:
          time: 7200s
          interval: 75s
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10
        maxRetries: 3
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 25

熔断参数说明:

参数 说明
consecutive5xxErrors 连续 5xx 错误次数触发熔断
interval 检测间隔
baseEjectionTime 基础驱逐时间
maxEjectionPercent 最大驱逐比例
minHealthPercent 最小健康实例比例

连接池配置

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 5s
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10

故障恢复最佳实践

1. 超时设置原则

# 超时 = 服务处理时间 + 网络延迟 + 缓冲时间
# 建议:P99 延迟的 2-3 倍
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - route:
    - destination:
        host: my-service
    timeout: 30s  # 假设 P99 延迟是 10s

2. 重试策略原则

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - route:
    - destination:
        host: my-service
    retries:
      attempts: 3
      perTryTimeout: 5s
      retryOn: gateway-error,connect-failure,refused-stream

注意: - 重试次数不宜过多(建议 ≤ 3) - 设置合理的 perTryTimeout - 避免对非幂等操作重试

3. 熔断器配置

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service
spec:
  host: my-service
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 30  # 不要驱逐太多实例

混沌测试

使用 Chaos Mesh

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
    - default
    labelSelectors:
      app: reviews
  delay:
    latency: "100ms"
    correlation: "50"
    jitter: "10ms"
  duration: "5m"

使用 Litmus

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
spec:
  appinfo:
    appns: default
    applabel: "app=nginx"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "60"

故障排查

检查配置

# 检查 VirtualService
istioctl analyze

# 检查代理配置
istioctl proxy-config cluster <pod-name>

# 检查路由
istioctl proxy-config route <pod-name>

# 检查监听器
istioctl proxy-config listener <pod-name>

调试命令

# 查看 Envoy 统计
kubectl exec <pod-name> -c istio-proxy -- curl localhost:15000/stats

# 查看集群配置
kubectl exec <pod-name> -c istio-proxy -- curl localhost:15000/clusters

# 查看当前配置
kubectl exec <pod-name> -c istio-proxy -- curl localhost:15000/config_dump

小结

本章介绍了:

  • 故障注入:延迟、中止、组合故障
  • 故障恢复:超时、重试、熔断
  • 混沌测试:Chaos Mesh、Litmus
  • 故障排查:istioctl 命令

下一章我们将学习多集群部署。