跳转至

第七章:监控与运维

监控指标

核心指标

指标 说明 告警阈值
broker_runtime Broker 运行时间 -
broker_fast_failed_num 快速失败数 > 100
broker_put_num_today 今日写入数 -
broker_get_num_today 今日读取数 -
broker_commitlog_max_offset CommitLog 最大偏移 -
broker_commitlog_min_offset CommitLog 最小偏移 -

JVM 指标

指标 说明 告警阈值
heap_used 堆内存使用 > 80%
gc_pause_seconds GC 暂停时间 P99 > 100ms
thread_count 线程数 > 1000

Prometheus 监控

Prometheus 配置

scrape_configs:
  - job_name: 'rocketmq'
    static_configs:
      - targets:
        - 'broker1:10911'
        - 'broker2:10911'
    metrics_path: /metrics

Grafana Dashboard

{
  "dashboard": {
    "title": "RocketMQ Dashboard",
    "panels": [
      {
        "title": "Broker TPS",
        "targets": [
          {
            "expr": "rate(rocketmq_broker_put_num_today[5m])"
          }
        ]
      },
      {
        "title": "Message Accumulation",
        "targets": [
          {
            "expr": "rocketmq_consumer_lag"
          }
        ]
      }
    ]
  }
}

告警规则

groups:
- name: rocketmq-alerts
  rules:
  - alert: RocketMQBrokerDown
    expr: up{job="rocketmq"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "RocketMQ Broker is down"

  - alert: RocketMQHighHeapUsage
    expr: jvm_heap_used_bytes / jvm_heap_max_bytes > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "RocketMQ Broker high heap usage"

  - alert: RocketMQMessageAccumulation
    expr: rocketmq_consumer_lag > 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "RocketMQ message accumulation"

运维命令

集群状态

# 查看集群信息
./mqadmin clusterList -n localhost:9876

# 查看 Topic 信息
./mqadmin topicList -n localhost:9876

# 查看 Topic 路由
./mqadmin topicRoute -n localhost:9876 -t TopicTest

# 查看 Topic 统计
./mqadmin topicStatus -n localhost:9876 -t TopicTest

消费者管理

# 查看消费者组
./mqadmin consumerList -n localhost:9876

# 查看消费进度
./mqadmin consumerProgress -n localhost:9876 -g ConsumerGroup

# 重置消费进度
./mqadmin resetOffsetByTime -n localhost:9876 -g ConsumerGroup -t TopicTest -s now

消息查询

# 根据 Key 查询消息
./mqadmin queryMsgByKey -n localhost:9876 -t TopicTest -k OrderID_123

# 根据 MsgId 查询消息
./mqadmin queryMsgById -n localhost:9876 -i msgId

# 查看消息轨迹
./mqadmin traceMsg -n localhost:9876 -t TopicTest -i msgId

日志管理

日志配置

# 日志级别
logging.level = INFO

# 日志路径
logging.path = /home/rocketmq/logs

# 日志保留天数
logging.retention.days = 7

日志分析

# 查看错误日志
grep -i error /home/rocketmq/logs/broker.log

# 统计消息发送失败
grep "SEND_FAILED" /home/rocketmq/logs/broker.log | wc -l

# 查看慢请求
grep "cost:" /home/rocketmq/logs/broker.log | awk -F'cost:' '{print $2}' | sort -n

小结

监控运维要点:

  • 监控指标:Broker、JVM、消息堆积
  • 告警规则:Broker 宕机、高内存、消息堆积
  • 运维命令:集群状态、消费者管理、消息查询
  • 日志管理:日志配置、日志分析

下一章我们将学习生产实践。