第七章:监控与运维¶
监控指标¶
核心指标¶
| 指标 | 说明 | 告警阈值 |
|---|---|---|
| broker_runtime | Broker 运行时间 | - |
| broker_fast_failed_num | 快速失败数 | > 100 |
| broker_put_num_today | 今日写入数 | - |
| broker_get_num_today | 今日读取数 | - |
| broker_commitlog_max_offset | CommitLog 最大偏移 | - |
| broker_commitlog_min_offset | CommitLog 最小偏移 | - |
JVM 指标¶
| 指标 | 说明 | 告警阈值 |
|---|---|---|
| heap_used | 堆内存使用 | > 80% |
| gc_pause_seconds | GC 暂停时间 | P99 > 100ms |
| thread_count | 线程数 | > 1000 |
Prometheus 监控¶
Prometheus 配置¶
scrape_configs:
- job_name: 'rocketmq'
static_configs:
- targets:
- 'broker1:10911'
- 'broker2:10911'
metrics_path: /metrics
Grafana Dashboard¶
{
"dashboard": {
"title": "RocketMQ Dashboard",
"panels": [
{
"title": "Broker TPS",
"targets": [
{
"expr": "rate(rocketmq_broker_put_num_today[5m])"
}
]
},
{
"title": "Message Accumulation",
"targets": [
{
"expr": "rocketmq_consumer_lag"
}
]
}
]
}
}
告警规则¶
groups:
- name: rocketmq-alerts
rules:
- alert: RocketMQBrokerDown
expr: up{job="rocketmq"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "RocketMQ Broker is down"
- alert: RocketMQHighHeapUsage
expr: jvm_heap_used_bytes / jvm_heap_max_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "RocketMQ Broker high heap usage"
- alert: RocketMQMessageAccumulation
expr: rocketmq_consumer_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "RocketMQ message accumulation"
运维命令¶
集群状态¶
# 查看集群信息
./mqadmin clusterList -n localhost:9876
# 查看 Topic 信息
./mqadmin topicList -n localhost:9876
# 查看 Topic 路由
./mqadmin topicRoute -n localhost:9876 -t TopicTest
# 查看 Topic 统计
./mqadmin topicStatus -n localhost:9876 -t TopicTest
消费者管理¶
# 查看消费者组
./mqadmin consumerList -n localhost:9876
# 查看消费进度
./mqadmin consumerProgress -n localhost:9876 -g ConsumerGroup
# 重置消费进度
./mqadmin resetOffsetByTime -n localhost:9876 -g ConsumerGroup -t TopicTest -s now
消息查询¶
# 根据 Key 查询消息
./mqadmin queryMsgByKey -n localhost:9876 -t TopicTest -k OrderID_123
# 根据 MsgId 查询消息
./mqadmin queryMsgById -n localhost:9876 -i msgId
# 查看消息轨迹
./mqadmin traceMsg -n localhost:9876 -t TopicTest -i msgId
日志管理¶
日志配置¶
# 日志级别
logging.level = INFO
# 日志路径
logging.path = /home/rocketmq/logs
# 日志保留天数
logging.retention.days = 7
日志分析¶
# 查看错误日志
grep -i error /home/rocketmq/logs/broker.log
# 统计消息发送失败
grep "SEND_FAILED" /home/rocketmq/logs/broker.log | wc -l
# 查看慢请求
grep "cost:" /home/rocketmq/logs/broker.log | awk -F'cost:' '{print $2}' | sort -n
小结¶
监控运维要点:
- 监控指标:Broker、JVM、消息堆积
- 告警规则:Broker 宕机、高内存、消息堆积
- 运维命令:集群状态、消费者管理、消息查询
- 日志管理:日志配置、日志分析
下一章我们将学习生产实践。