第七章:告警与监控¶
本章介绍分布式追踪系统的告警配置和监控方案。
告警场景¶
服务性能告警¶
# SkyWalking 告警配置
rules:
# 响应时间告警
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
message: "服务 {name} 响应时间超过 1 秒"
# 错误率告警
service_error_rate_rule:
metrics-name: service_error_rate
op: ">"
threshold: 1
period: 10
count: 3
message: "服务 {name} 错误率超过 1%"
# 吞吐量告警
service_throughput_rule:
metrics-name: service_throughput
op: "<"
threshold: 10
period: 10
count: 3
message: "服务 {name} 吞吐量低于 10/min"
端点告警¶
# 端点响应时间
endpoint_resp_time_rule:
metrics-name: endpoint_resp_time
op: ">"
threshold: 500
period: 10
count: 3
message: "端点 {name} 响应时间超过 500ms"
# 端点错误率
endpoint_error_rate_rule:
metrics-name: endpoint_error_rate
op: ">"
threshold: 5
period: 10
count: 3
message: "端点 {name} 错误率超过 5%"
数据库告警¶
# 数据库响应时间
database_resp_time_rule:
metrics-name: database_access_resp_time
op: ">"
threshold: 100
period: 10
count: 3
message: "数据库 {name} 响应时间超过 100ms"
告警通道¶
Webhook 通知¶
webhooks:
- url: http://alert-service/webhook/skywalking
method: POST
headers:
Content-Type: application/json
Authorization: Bearer ${ALERT_TOKEN}
Slack 通知¶
import requests
def send_slack_alert(alert):
webhook_url = "https://hooks.slack.com/services/xxx"
payload = {
"channel": "#alerts",
"username": "SkyWalking Alert",
"text": f":warning: {alert['message']}",
"attachments": [{
"color": "danger",
"fields": [
{"title": "Service", "value": alert['service'], "short": True},
{"title": "Metric", "value": alert['metric'], "short": True},
{"title": "Value", "value": alert['value'], "short": True},
{"title": "Time", "value": alert['time'], "short": True}
]
}]
}
requests.post(webhook_url, json=payload)
PagerDuty 通知¶
def send_pagerduty_alert(alert):
url = "https://events.pagerduty.com/v2/enqueue"
payload = {
"routing_key": "xxx",
"event_action": "trigger",
"dedup_key": alert['id'],
"payload": {
"summary": alert['message'],
"severity": "critical",
"source": alert['service'],
"custom_details": alert
}
}
requests.post(url, json=payload)
Prometheus 监控¶
Jaeger 指标¶
# Prometheus 抓取配置
scrape_configs:
- job_name: 'jaeger-collector'
static_configs:
- targets: ['jaeger-collector:14269']
- job_name: 'jaeger-query'
static_configs:
- targets: ['jaeger-query:16687']
关键指标¶
# Span 摄入速率
rate(jaeger_collector_spans_received_total[5m])
# Span 处理延迟
histogram_quantile(0.99, rate(jaeger_collector_spans_received_latency_bucket[5m]))
# 查询延迟
histogram_quantile(0.99, rate(jaeger_query_latency_bucket[5m]))
# 存储错误率
rate(jaeger_collector_spans_saved_by_svc_total{result="err"}[5m])
告警规则¶
groups:
- name: jaeger-alerts
rules:
- alert: JaegerCollectorDown
expr: up{job="jaeger-collector"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Jaeger Collector is down"
- alert: HighSpanLatency
expr: |
histogram_quantile(0.99,
rate(jaeger_collector_spans_received_latency_bucket[5m])
) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High span processing latency"
- alert: HighStorageErrorRate
expr: |
rate(jaeger_collector_spans_saved_by_svc_total{result="err"}[5m])
/ rate(jaeger_collector_spans_saved_by_svc_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High storage error rate"
Grafana 仪表板¶
导入仪表板¶
自定义仪表板¶
{
"dashboard": {
"title": "Tracing Overview",
"panels": [
{
"title": "Span Ingestion Rate",
"type": "graph",
"targets": [
{
"expr": "rate(jaeger_collector_spans_received_total[5m])",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Query Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(jaeger_query_latency_bucket[5m]))",
"legendFormat": "P99"
}
]
}
]
}
}
运维监控¶
健康检查¶
# Jaeger Collector 健康检查
curl http://jaeger-collector:14269/health
# Jaeger Query 健康检查
curl http://jaeger-query:16687/health
# SkyWalking OAP 健康检查
curl http://skywalking-oap:12800/healthcheck
自动化脚本¶
#!/usr/bin/env python3
import requests
import json
def check_jaeger_health():
services = [
('collector', 'http://jaeger-collector:14269/health'),
('query', 'http://jaeger-query:16687/health')
]
for name, url in services:
try:
resp = requests.get(url, timeout=5)
if resp.status_code != 200:
send_alert(f"Jaeger {name} unhealthy: {resp.status_code}")
except Exception as e:
send_alert(f"Jaeger {name} unreachable: {e}")
def check_storage_health():
# 检查 Elasticsearch
resp = requests.get('http://elasticsearch:9200/_cluster/health')
health = resp.json()
if health['status'] == 'red':
send_alert("Elasticsearch cluster is RED")
elif health['status'] == 'yellow':
send_alert("Elasticsearch cluster is YELLOW")
def send_alert(message):
print(f"ALERT: {message}")
# 发送到 Slack/PagerDuty
if __name__ == "__main__":
check_jaeger_health()
check_storage_health()
SLO 监控¶
定义 SLO¶
# SLO 定义
slos:
- name: trace_collection
target: 99.9
description: "追踪数据收集可用性"
- name: trace_query
target: 99.5
description: "追踪查询可用性"
- name: trace_latency
target: 95
description: "追踪延迟 < 100ms"
SLO 监控¶
# 追踪收集可用性
sum(rate(jaeger_collector_spans_saved_by_svc_total{result="ok"}[30d]))
/
sum(rate(jaeger_collector_spans_received_total[30d]))
# 追踪查询可用性
sum(rate(jaeger_query_requests_total{status="ok"}[30d]))
/
sum(rate(jaeger_query_requests_total[30d]))
# 追踪延迟 SLO
sum(rate(jaeger_collector_spans_received_latency_bucket{le="0.1"}[30d]))
/
sum(rate(jaeger_collector_spans_received_latency_count[30d]))
小结¶
告警与监控要点:
- 告警场景:服务、端点、数据库
- 告警通道:Webhook、Slack、PagerDuty
- Prometheus:指标收集和告警
- Grafana:可视化仪表板
- 运维监控:健康检查、自动化
- SLO 监控:可用性和延迟
下一章我们将学习生产实践。