第六章:Alertmanager 告警¶
什么是 Alertmanager?¶
Alertmanager 是 Prometheus 生态中的告警管理组件,负责处理来自 Prometheus 的告警,进行去重、分组、路由和通知。
架构¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prometheus │────▶│Alertmanager │────▶│ Receivers │
│ (告警规则) │ │ (告警处理) │ │ (通知渠道) │
└─────────────┘ └─────────────┘ └─────────────┘
│
┌──────┴──────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Silence │ │ Inhibit │
│ (静默) │ │ (抑制) │
└─────────┘ └─────────┘
安装 Alertmanager¶
Docker 安装¶
docker run -d \
--name alertmanager \
-p 9093:9093 \
-v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager
Docker Compose¶
version: '3.8'
services:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
volumes:
alertmanager-data:
配置文件详解¶
alertmanager.yml 基本结构¶
# 全局配置
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
# 模板
templates:
- '/etc/alertmanager/templates/*.tmpl'
# 路由
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
# 接收器
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
全局配置¶
global:
# 超时时间
resolve_timeout: 5m
# SMTP 配置
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
smtp_require_tls: true
# Slack 配置
slack_api_url: 'https://hooks.slack.com/services/xxx'
# PagerDuty 配置
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# OpsGenie 配置
opsgenie_api_key: 'xxx'
opsgenie_api_url: 'https://api.opsgenie.com/'
# WeChat 配置
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_secret: 'xxx'
wechat_api_corp_id: 'xxx'
路由配置¶
route:
# 默认接收器
receiver: 'default'
# 分组标签
group_by: ['alertname', 'severity']
# 等待时间(同组告警等待合并)
group_wait: 30s
# 组间隔(同组新告警等待时间)
group_interval: 5m
# 重复间隔(重复告警发送间隔)
repeat_interval: 4h
# 子路由
routes:
# 严重告警 -> 立即通知
- match:
severity: critical
receiver: 'critical'
group_wait: 10s
repeat_interval: 1h
continue: true
# 警告 -> 邮件通知
- match:
severity: warning
receiver: 'warning'
group_wait: 5m
# 正则匹配
- match_re:
alertname: ^(NodeDown|HighCPUUsage)$
receiver: 'infrastructure'
# 标签匹配
- match:
team: frontend
receiver: 'frontend-team'
- match:
team: backend
receiver: 'backend-team'
接收器配置¶
receivers:
# 默认接收器
- name: 'default'
email_configs:
- to: 'team@example.com'
# 严重告警接收器
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
pagerduty_configs:
- service_key: 'xxx'
severity: critical
# 警告接收器
- name: 'warning'
email_configs:
- to: 'team@example.com'
send_resolved: true
# Webhook 接收器
- name: 'webhook'
webhook_configs:
- url: 'http://webhook.example.com/alert'
send_resolved: true
# 企业微信接收器
- name: 'wechat'
wechat_configs:
- corp_id: 'xxx'
to_party: '1'
agent_id: 'xxx'
api_secret: 'xxx'
message: '{{ .Status }}: {{ .CommonAnnotations.summary }}'
通知模板¶
自定义模板¶
# templates/default.tmpl
{{ define "email.default.subject" }}
[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}
{{ end }}
{{ define "email.default.html" }}
<!DOCTYPE html>
<html>
<head>
<style>
.critical { color: red; }
.warning { color: orange; }
.resolved { color: green; }
</style>
</head>
<body>
<h2>Alert Status: <span class="{{ .Status }}">{{ .Status | toUpper }}</span></h2>
{{ range .Alerts }}
<h3>{{ .Annotations.summary }}</h3>
<p>{{ .Annotations.description }}</p>
<ul>
<li><strong>Starts At:</strong> {{ .StartsAt }}</li>
<li><strong>Ends At:</strong> {{ .EndsAt }}</li>
{{ range .Labels.SortedPairs }}
<li><strong>{{ .Name }}:</strong> {{ .Value }}</li>
{{ end }}
</ul>
{{ end }}
</body>
</html>
{{ end }}
{{ define "slack.default.title" }}
{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}
{{ end }}
{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* {{ .Value }}
{{ end }}
{{ end }}
{{ end }}
使用模板¶
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
html: '{{ template "email.default.html" . }}'
subject: '{{ template "email.default.subject" . }}'
抑制规则¶
inhibit_rules:
# 严重告警抑制警告
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# 节点宕机抑制其他告警
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '.*'
equal: ['instance']
# 主服务抑制从服务
- source_match:
role: 'master'
target_match:
role: 'slave'
equal: ['service']
静默(Silence)¶
通过 Web UI 创建¶
- 访问
http://alertmanager:9093 - 点击 "Silences"
- 点击 "New Silence"
- 配置匹配器和持续时间
- 点击 "Create"
通过 API 创建¶
# 创建静默
curl -X POST http://localhost:9093/api/v2/silences -d '{
"matchers": [
{
"name": "alertname",
"value": "HighCPUUsage",
"isRegex": false
}
],
"startsAt": "2024-01-01T00:00:00Z",
"endsAt": "2024-01-01T01:00:00Z",
"createdBy": "admin",
"comment": "Planned maintenance"
}'
# 查看静默
curl http://localhost:9093/api/v2/silences
# 删除静默
curl -X DELETE http://localhost:9093/api/v2/silence/<silence_id>
通过 amtool 创建¶
# 安装 amtool
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
# 创建静默
amtool silence add alertname=HighCPUUsage --duration=1h --comment="Maintenance"
# 查看静默
amtool silence query
# 删除静默
amtool silence expire <silence_id>
Prometheus 集成¶
prometheus.yml 配置¶
# 告警规则文件
rule_files:
- "alerts/*.yml"
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 或使用服务发现
# kubernetes_sd_configs:
# - role: endpoints
# namespaces:
# names:
# - monitoring
告警规则示例¶
# alerts/node.yml
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has only {{ $value }}% free space"
高可用配置¶
集群模式¶
# alertmanager-1.yml
global:
resolve_timeout: 5m
route:
receiver: 'default'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
# 集群配置
cluster:
listen_address: "0.0.0.0:9094"
peers:
- "alertmanager-2:9094"
- "alertmanager-3:9094"
Docker Compose 集群¶
version: '3.8'
services:
alertmanager-1:
image: prom/alertmanager:latest
container_name: alertmanager-1
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-2:9094'
- '--cluster.peer=alertmanager-3:9094'
alertmanager-2:
image: prom/alertmanager:latest
container_name: alertmanager-2
ports:
- "9094:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-3:9094'
alertmanager-3:
image: prom/alertmanager:latest
container_name: alertmanager-3
ports:
- "9095:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-2:9094'
API 接口¶
# 查看告警
curl http://localhost:9093/api/v2/alerts
# 查看状态
curl http://localhost:9093/api/v2/status
# 查看接收器
curl http://localhost:9093/api/v2/receivers
# 测试通知
amtool alert add alertname=TestAlert severity=warning --alertmanager.url=http://localhost:9093
小结¶
本章学习了:
- ✅ Alertmanager 架构和安装
- ✅ 配置文件详解
- ✅ 路由和接收器配置
- ✅ 通知模板
- ✅ 抑制和静默
- ✅ 高可用配置
总结¶
通过这六章的学习,你已经掌握了:
- Prometheus 基础 - 概念、架构、安装
- Prometheus 配置 - 服务发现、告警规则
- PromQL 查询 - 查询语言和函数
- Exporter - 数据采集配置
- Grafana - 可视化配置
- Alertmanager - 告警管理
继续学习:Helm 教程 - Kubernetes 包管理。