第五章:日志分析实战¶
本章通过实际场景演示如何使用 ELK Stack 进行日志分析。
场景一:错误日志分析¶
1. 发现错误¶
在 Kibana Discover 中搜索错误:
2. 错误趋势分析¶
GET logs-app/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "term": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-24h" } } }
]
}
},
"aggs": {
"errors_over_time": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1h"
}
},
"errors_by_service": {
"terms": {
"field": "service.keyword",
"size": 10
}
},
"errors_by_type": {
"terms": {
"field": "error_type.keyword",
"size": 10
}
}
}
}
3. 错误聚类¶
GET logs-app/_search
{
"size": 0,
"query": {
"term": { "level": "ERROR" }
},
"aggs": {
"error_patterns": {
"terms": {
"field": "message.keyword",
"size": 20,
"min_doc_count": 5
}
}
}
}
4. 关联分析¶
GET logs-app/_search
{
"query": {
"bool": {
"must": [
{ "term": { "trace_id": "abc123" } }
]
}
},
"sort": [
{ "@timestamp": "asc" }
]
}
场景二:性能分析¶
1. 慢请求识别¶
GET logs-app/_search
{
"query": {
"range": {
"duration_ms": { "gte": 1000 }
}
},
"sort": [
{ "duration_ms": "desc" }
],
"size": 100
}
2. 响应时间分布¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"duration_percentiles": {
"percentiles": {
"field": "duration_ms",
"percents": [50, 75, 90, 95, 99]
}
},
"duration_histogram": {
"histogram": {
"field": "duration_ms",
"interval": 100
}
}
}
}
3. 服务性能对比¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"services": {
"terms": {
"field": "service.keyword",
"size": 20
},
"aggs": {
"avg_duration": {
"avg": { "field": "duration_ms" }
},
"p99_duration": {
"percentiles": {
"field": "duration_ms",
"percents": [99]
}
},
"request_count": {
"value_count": { "field": "_id" }
}
}
}
}
}
4. 性能趋势¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"over_time": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "5m"
},
"aggs": {
"p99": {
"percentiles": {
"field": "duration_ms",
"percents": [99]
}
}
}
}
}
}
场景三:用户行为分析¶
1. 用户访问路径¶
GET logs-app/_search
{
"query": {
"term": { "user_id": "user-001" }
},
"sort": [
{ "@timestamp": "asc" }
],
"_source": ["@timestamp", "action", "path", "duration_ms"]
}
2. 热门接口¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"top_paths": {
"terms": {
"field": "path.keyword",
"size": 20
},
"aggs": {
"avg_duration": {
"avg": { "field": "duration_ms" }
}
}
}
}
}
3. 用户活跃度¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"active_users": {
"cardinality": {
"field": "user_id"
}
},
"users_over_time": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1h"
},
"aggs": {
"unique_users": {
"cardinality": {
"field": "user_id"
}
}
}
}
}
}
4. 地理分布¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"geo_distribution": {
"terms": {
"field": "geoip_country.keyword",
"size": 20
}
}
}
}
场景四:安全审计¶
1. 登录失败分析¶
GET logs-app/_search
{
"query": {
"bool": {
"must": [
{ "term": { "action": "login" } },
{ "term": { "status": "failed" } }
]
}
},
"aggs": {
"by_ip": {
"terms": {
"field": "client_ip.keyword",
"size": 20
}
},
"by_user": {
"terms": {
"field": "username.keyword",
"size": 20
}
}
}
}
2. 异常访问检测¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"ips": {
"terms": {
"field": "client_ip.keyword",
"size": 100,
"order": { "_count": "desc" }
},
"aggs": {
"request_rate": {
"rate": {
"field": "@timestamp",
"interval": "1m"
}
}
}
}
}
}
3. 敏感操作审计¶
GET logs-app/_search
{
"query": {
"bool": {
"should": [
{ "term": { "action": "delete" } },
{ "term": { "action": "update_permissions" } },
{ "term": { "action": "export_data" } }
]
}
},
"sort": [
{ "@timestamp": "desc" }
]
}
场景五:容量规划¶
1. 存储增长趋势¶
2. 索引大小分析¶
3. 按服务统计日志量¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"by_service": {
"terms": {
"field": "service.keyword",
"size": 20
},
"aggs": {
"daily_volume": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "day"
}
}
}
}
}
}
日志分析最佳实践¶
1. 建立基线¶
// 记录正常状态
PUT logs-baseline/_doc/1
{
"service": "user-service",
"normal_error_rate": 0.5,
"normal_p99_latency": 200,
"normal_qps": 1000
}
2. 异常检测¶
GET logs-app/_search
{
"size": 0,
"aggs": {
"current_error_rate": {
"filter": {
"term": { "level": "ERROR" }
}
},
"total_requests": {
"value_count": { "field": "_id" }
}
}
}
3. 告警阈值¶
| 指标 | 正常范围 | 警告 | 严重 |
|---|---|---|---|
| 错误率 | < 1% | 1-5% | > 5% |
| P99 延迟 | < 500ms | 500-1000ms | > 1000ms |
| QPS 下降 | < 10% | 10-30% | > 30% |
小结¶
日志分析实战场景:
- 错误分析:发现、聚类、关联
- 性能分析:慢请求、分布、趋势
- 用户行为:路径、热门接口、活跃度
- 安全审计:登录失败、异常访问
- 容量规划:存储趋势、日志量统计
下一章我们将学习性能优化。