跳转至

第五章:可观测性

Istio 提供了完整的可观测性能力,包括指标、日志和分布式追踪。

可观测性三大支柱

┌─────────────────────────────────────────────────┐
│                  可观测性                         │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────┐ │
│  │   Metrics   │  │    Logs     │  │ Traces  │ │
│  │   指标      │  │    日志     │  │  追踪   │ │
│  │             │  │             │  │         │ │
│  │ Prometheus  │  │   ELK/Loki  │  │ Jaeger  │ │
│  └─────────────┘  └─────────────┘  └─────────┘ │
│                                                 │
└─────────────────────────────────────────────────┘

指标收集

内置指标

Istio 自动生成以下指标:

请求指标: - istio_requests_total:请求总数 - istio_request_duration_milliseconds:请求延迟 - istio_request_bytes:请求大小 - istio_response_bytes:响应大小

TCP 指标: - istio_tcp_connections_opened_total - istio_tcp_connections_closed_total - istio_tcp_received_bytes_total - istio_tcp_sent_bytes_total

Prometheus 集成

# Prometheus 配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus
  namespace: istio-system
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'istio-mesh'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        regex: 'istio-telemetry'
        action: keep

自定义指标

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: custom-metrics
spec:
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - name: request_count
      dimensions:
        request_method: request.method
        request_path: request.path
    - name: request_duration
      dimensions:
        destination_service: destination.service

Grafana 仪表板

# 安装 Grafana
kubectl apply -f samples/addons/grafana.yaml

# 端口转发
kubectl port-forward -n istio-system svc/grafana 3000:3000

# 访问
open http://localhost:3000

常用仪表板: - Istio Mesh Dashboard - Istio Service Dashboard - Istio Workload Dashboard - Istio Performance Dashboard

分布式追踪

配置追踪

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    enableTracing: true
    defaultConfig:
      tracing:
        sampling: 100  # 采样率 100%
        zipkin:
          address: zipkin.istio-system:9411

Jaeger 部署

# 安装 Jaeger
kubectl apply -f samples/addons/jaeger.yaml

# 端口转发
kubectl port-forward -n istio-system svc/tracing 16686:80

# 访问 UI
open http://localhost:16686

自定义 Trace Span

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# 获取 tracer
tracer = trace.get_tracer(__name__)

# 创建 span
with tracer.start_as_current_span("custom-operation") as span:
    span.set_attribute("user.id", "12345")
    span.set_attribute("operation.type", "database")
    # 业务逻辑

Trace 上下文传播

from opentelemetry.propagate import inject, extract

# 客户端:注入上下文
headers = {}
inject(headers)

# 服务端:提取上下文
context = extract(request.headers)

访问日志

启用访问日志

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: default-telemetry
spec:
  accessLogging:
  - providers:
    - name: otel

自定义日志格式

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: custom-logging
spec:
  accessLogging:
  - providers:
    - name: otel
    format:
      text: |
        [%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%"
        %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT%
        %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%"
        "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%"

ELK 集成

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: elk-logging
spec:
  accessLogging:
  - providers:
    - name: otel
    format:
      text: |
        {"timestamp":"%START_TIME%","method":"%REQ(:METHOD)%",
        "path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
        "status":%RESPONSE_CODE%,"duration":%DURATION%}

Kiali 可视化

安装 Kiali

# 安装 Kiali
kubectl apply -f samples/addons/kiali.yaml

# 端口转发
istioctl dashboard kiali

# 或
kubectl port-forward -n istio-system svc/kiali 20001:20001

Kiali 功能

  • 服务拓扑图:可视化服务依赖关系
  • 健康状态:实时监控服务健康
  • 配置验证:检查 Istio 配置错误
  • 流量动画:观察实时流量走向

查看服务拓扑

# 通过 CLI 打开
istioctl dashboard kiali

# 或访问
open http://localhost:20001/kiali

实战:完整可观测性栈

# 部署完整监控栈
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  profile: default
  meshConfig:
    enableTracing: true
    accessLogFile: /dev/stdout
    defaultConfig:
      tracing:
        sampling: 100
        zipkin:
          address: zipkin.monitoring:9411
    extensionProviders:
    - name: otel
      envoyOtelAls:
        service: opentelemetry-collector.monitoring.svc.cluster.local
        port: 4317

指标查询示例

Prometheus 查询

# 请求速率
rate(istio_requests_total[5m])

# P99 延迟
histogram_quantile(0.99, 
  rate(istio_request_duration_milliseconds_bucket[5m])
)

# 错误率
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) 
/ 
sum(rate(istio_requests_total[5m]))

# 服务间流量
sum(istio_requests_total) by (source_workload, destination_workload)

Grafana 告警规则

groups:
- name: istio-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(istio_requests_total{response_code=~"5.."}[5m])) 
      / sum(rate(istio_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanize }}"

  - alert: HighLatency
    expr: |
      histogram_quantile(0.99, 
        rate(istio_request_duration_milliseconds_bucket[5m])
      ) > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"
      description: "P99 latency is {{ $value }}ms"

小结

Istio 可观测性提供了:

  • 指标:自动收集请求、TCP 指标
  • 追踪:分布式调用链追踪
  • 日志:访问日志收集
  • 可视化:Kiali 服务拓扑

下一章我们将学习故障注入与恢复。