第五章:可观测性¶
Istio 提供了完整的可观测性能力,包括指标、日志和分布式追踪。
可观测性三大支柱¶
┌─────────────────────────────────────────────────┐
│ 可观测性 │
├─────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ │ 指标 │ │ 日志 │ │ 追踪 │ │
│ │ │ │ │ │ │ │
│ │ Prometheus │ │ ELK/Loki │ │ Jaeger │ │
│ └─────────────┘ └─────────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────┘
指标收集¶
内置指标¶
Istio 自动生成以下指标:
请求指标:
- istio_requests_total:请求总数
- istio_request_duration_milliseconds:请求延迟
- istio_request_bytes:请求大小
- istio_response_bytes:响应大小
TCP 指标:
- istio_tcp_connections_opened_total
- istio_tcp_connections_closed_total
- istio_tcp_received_bytes_total
- istio_tcp_sent_bytes_total
Prometheus 集成¶
# Prometheus 配置
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: 'istio-telemetry'
action: keep
自定义指标¶
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: custom-metrics
spec:
metrics:
- providers:
- name: prometheus
overrides:
- name: request_count
dimensions:
request_method: request.method
request_path: request.path
- name: request_duration
dimensions:
destination_service: destination.service
Grafana 仪表板¶
# 安装 Grafana
kubectl apply -f samples/addons/grafana.yaml
# 端口转发
kubectl port-forward -n istio-system svc/grafana 3000:3000
# 访问
open http://localhost:3000
常用仪表板: - Istio Mesh Dashboard - Istio Service Dashboard - Istio Workload Dashboard - Istio Performance Dashboard
分布式追踪¶
配置追踪¶
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100 # 采样率 100%
zipkin:
address: zipkin.istio-system:9411
Jaeger 部署¶
# 安装 Jaeger
kubectl apply -f samples/addons/jaeger.yaml
# 端口转发
kubectl port-forward -n istio-system svc/tracing 16686:80
# 访问 UI
open http://localhost:16686
自定义 Trace Span¶
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
# 获取 tracer
tracer = trace.get_tracer(__name__)
# 创建 span
with tracer.start_as_current_span("custom-operation") as span:
span.set_attribute("user.id", "12345")
span.set_attribute("operation.type", "database")
# 业务逻辑
Trace 上下文传播¶
from opentelemetry.propagate import inject, extract
# 客户端:注入上下文
headers = {}
inject(headers)
# 服务端:提取上下文
context = extract(request.headers)
访问日志¶
启用访问日志¶
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: default-telemetry
spec:
accessLogging:
- providers:
- name: otel
自定义日志格式¶
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: custom-logging
spec:
accessLogging:
- providers:
- name: otel
format:
text: |
[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%"
%RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT%
%DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%"
"%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%"
ELK 集成¶
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: elk-logging
spec:
accessLogging:
- providers:
- name: otel
format:
text: |
{"timestamp":"%START_TIME%","method":"%REQ(:METHOD)%",
"path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
"status":%RESPONSE_CODE%,"duration":%DURATION%}
Kiali 可视化¶
安装 Kiali¶
# 安装 Kiali
kubectl apply -f samples/addons/kiali.yaml
# 端口转发
istioctl dashboard kiali
# 或
kubectl port-forward -n istio-system svc/kiali 20001:20001
Kiali 功能¶
- 服务拓扑图:可视化服务依赖关系
- 健康状态:实时监控服务健康
- 配置验证:检查 Istio 配置错误
- 流量动画:观察实时流量走向
查看服务拓扑¶
实战:完整可观测性栈¶
# 部署完整监控栈
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
profile: default
meshConfig:
enableTracing: true
accessLogFile: /dev/stdout
defaultConfig:
tracing:
sampling: 100
zipkin:
address: zipkin.monitoring:9411
extensionProviders:
- name: otel
envoyOtelAls:
service: opentelemetry-collector.monitoring.svc.cluster.local
port: 4317
指标查询示例¶
Prometheus 查询¶
# 请求速率
rate(istio_requests_total[5m])
# P99 延迟
histogram_quantile(0.99,
rate(istio_request_duration_milliseconds_bucket[5m])
)
# 错误率
sum(rate(istio_requests_total{response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total[5m]))
# 服务间流量
sum(istio_requests_total) by (source_workload, destination_workload)
Grafana 告警规则¶
groups:
- name: istio-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanize }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(istio_request_duration_milliseconds_bucket[5m])
) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is {{ $value }}ms"
小结¶
Istio 可观测性提供了:
- 指标:自动收集请求、TCP 指标
- 追踪:分布式调用链追踪
- 日志:访问日志收集
- 可视化:Kiali 服务拓扑
下一章我们将学习故障注入与恢复。