跳转至

第六章:Alertmanager 告警

什么是 Alertmanager?

Alertmanager 是 Prometheus 生态中的告警管理组件,负责处理来自 Prometheus 的告警,进行去重、分组、路由和通知。

架构

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Prometheus  │────▶│Alertmanager │────▶│  Receivers  │
│  (告警规则)  │     │  (告警处理)  │     │ (通知渠道)  │
└─────────────┘     └─────────────┘     └─────────────┘
                    ┌──────┴──────┐
                    │             │
                    ▼             ▼
              ┌─────────┐   ┌─────────┐
              │ Silence │   │ Inhibit │
              │ (静默)   │   │ (抑制)   │
              └─────────┘   └─────────┘

安装 Alertmanager

Docker 安装

docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager

Docker Compose

version: '3.8'

services:
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    restart: unless-stopped

volumes:
  alertmanager-data:

配置文件详解

alertmanager.yml 基本结构

# 全局配置
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'

# 模板
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# 路由
route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

# 接收器
receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

# 抑制规则
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

全局配置

global:
  # 超时时间
  resolve_timeout: 5m

  # SMTP 配置
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'
  smtp_require_tls: true

  # Slack 配置
  slack_api_url: 'https://hooks.slack.com/services/xxx'

  # PagerDuty 配置
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

  # OpsGenie 配置
  opsgenie_api_key: 'xxx'
  opsgenie_api_url: 'https://api.opsgenie.com/'

  # WeChat 配置
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'xxx'
  wechat_api_corp_id: 'xxx'

路由配置

route:
  # 默认接收器
  receiver: 'default'

  # 分组标签
  group_by: ['alertname', 'severity']

  # 等待时间(同组告警等待合并)
  group_wait: 30s

  # 组间隔(同组新告警等待时间)
  group_interval: 5m

  # 重复间隔(重复告警发送间隔)
  repeat_interval: 4h

  # 子路由
  routes:
    # 严重告警 -> 立即通知
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 10s
      repeat_interval: 1h
      continue: true

    # 警告 -> 邮件通知
    - match:
        severity: warning
      receiver: 'warning'
      group_wait: 5m

    # 正则匹配
    - match_re:
        alertname: ^(NodeDown|HighCPUUsage)$
      receiver: 'infrastructure'

    # 标签匹配
    - match:
        team: frontend
      receiver: 'frontend-team'
    - match:
        team: backend
      receiver: 'backend-team'

接收器配置

receivers:
  # 默认接收器
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  # 严重告警接收器
  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
    pagerduty_configs:
      - service_key: 'xxx'
        severity: critical

  # 警告接收器
  - name: 'warning'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true

  # Webhook 接收器
  - name: 'webhook'
    webhook_configs:
      - url: 'http://webhook.example.com/alert'
        send_resolved: true

  # 企业微信接收器
  - name: 'wechat'
    wechat_configs:
      - corp_id: 'xxx'
        to_party: '1'
        agent_id: 'xxx'
        api_secret: 'xxx'
        message: '{{ .Status }}: {{ .CommonAnnotations.summary }}'

通知模板

自定义模板

# templates/default.tmpl
{{ define "email.default.subject" }}
[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}
{{ end }}

{{ define "email.default.html" }}
<!DOCTYPE html>
<html>
<head>
  <style>
    .critical { color: red; }
    .warning { color: orange; }
    .resolved { color: green; }
  </style>
</head>
<body>
  <h2>Alert Status: <span class="{{ .Status }}">{{ .Status | toUpper }}</span></h2>

  {{ range .Alerts }}
  <h3>{{ .Annotations.summary }}</h3>
  <p>{{ .Annotations.description }}</p>
  <ul>
    <li><strong>Starts At:</strong> {{ .StartsAt }}</li>
    <li><strong>Ends At:</strong> {{ .EndsAt }}</li>
    {{ range .Labels.SortedPairs }}
    <li><strong>{{ .Name }}:</strong> {{ .Value }}</li>
    {{ end }}
  </ul>
  {{ end }}
</body>
</html>
{{ end }}

{{ define "slack.default.title" }}
{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}
{{ end }}

{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }}  • *{{ .Name }}:* {{ .Value }}
{{ end }}
{{ end }}
{{ end }}

使用模板

receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'
        html: '{{ template "email.default.html" . }}'
        subject: '{{ template "email.default.subject" . }}'

抑制规则

inhibit_rules:
  # 严重告警抑制警告
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # 节点宕机抑制其他告警
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: '.*'
    equal: ['instance']

  # 主服务抑制从服务
  - source_match:
      role: 'master'
    target_match:
      role: 'slave'
    equal: ['service']

静默(Silence)

通过 Web UI 创建

  1. 访问 http://alertmanager:9093
  2. 点击 "Silences"
  3. 点击 "New Silence"
  4. 配置匹配器和持续时间
  5. 点击 "Create"

通过 API 创建

# 创建静默
curl -X POST http://localhost:9093/api/v2/silences -d '{
  "matchers": [
    {
      "name": "alertname",
      "value": "HighCPUUsage",
      "isRegex": false
    }
  ],
  "startsAt": "2024-01-01T00:00:00Z",
  "endsAt": "2024-01-01T01:00:00Z",
  "createdBy": "admin",
  "comment": "Planned maintenance"
}'

# 查看静默
curl http://localhost:9093/api/v2/silences

# 删除静默
curl -X DELETE http://localhost:9093/api/v2/silence/<silence_id>

通过 amtool 创建

# 安装 amtool
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/

# 创建静默
amtool silence add alertname=HighCPUUsage --duration=1h --comment="Maintenance"

# 查看静默
amtool silence query

# 删除静默
amtool silence expire <silence_id>

Prometheus 集成

prometheus.yml 配置

# 告警规则文件
rule_files:
  - "alerts/*.yml"

# Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093
      # 或使用服务发现
      # kubernetes_sd_configs:
      #   - role: endpoints
      #     namespaces:
      #       names:
      #         - monitoring

告警规则示例

# alerts/node.yml
groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value }}% free space"

高可用配置

集群模式

# alertmanager-1.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

# 集群配置
cluster:
  listen_address: "0.0.0.0:9094"
  peers:
    - "alertmanager-2:9094"
    - "alertmanager-3:9094"

Docker Compose 集群

version: '3.8'

services:
  alertmanager-1:
    image: prom/alertmanager:latest
    container_name: alertmanager-1
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--cluster.peer=alertmanager-3:9094'

  alertmanager-2:
    image: prom/alertmanager:latest
    container_name: alertmanager-2
    ports:
      - "9094:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-3:9094'

  alertmanager-3:
    image: prom/alertmanager:latest
    container_name: alertmanager-3
    ports:
      - "9095:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-2:9094'

API 接口

# 查看告警
curl http://localhost:9093/api/v2/alerts

# 查看状态
curl http://localhost:9093/api/v2/status

# 查看接收器
curl http://localhost:9093/api/v2/receivers

# 测试通知
amtool alert add alertname=TestAlert severity=warning --alertmanager.url=http://localhost:9093

小结

本章学习了:

  • ✅ Alertmanager 架构和安装
  • ✅ 配置文件详解
  • ✅ 路由和接收器配置
  • ✅ 通知模板
  • ✅ 抑制和静默
  • ✅ 高可用配置

总结

通过这六章的学习,你已经掌握了:

  1. Prometheus 基础 - 概念、架构、安装
  2. Prometheus 配置 - 服务发现、告警规则
  3. PromQL 查询 - 查询语言和函数
  4. Exporter - 数据采集配置
  5. Grafana - 可视化配置
  6. Alertmanager - 告警管理

继续学习:Helm 教程 - Kubernetes 包管理。