跳转至

第五章:日志分析实战

本章通过实际场景演示如何使用 ELK Stack 进行日志分析。

场景一:错误日志分析

1. 发现错误

在 Kibana Discover 中搜索错误:

level: ERROR

2. 错误趋势分析

GET logs-app/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-24h" } } }
      ]
    }
  },
  "aggs": {
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1h"
      }
    },
    "errors_by_service": {
      "terms": {
        "field": "service.keyword",
        "size": 10
      }
    },
    "errors_by_type": {
      "terms": {
        "field": "error_type.keyword",
        "size": 10
      }
    }
  }
}

3. 错误聚类

GET logs-app/_search
{
  "size": 0,
  "query": {
    "term": { "level": "ERROR" }
  },
  "aggs": {
    "error_patterns": {
      "terms": {
        "field": "message.keyword",
        "size": 20,
        "min_doc_count": 5
      }
    }
  }
}

4. 关联分析

GET logs-app/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "trace_id": "abc123" } }
      ]
    }
  },
  "sort": [
    { "@timestamp": "asc" }
  ]
}

场景二:性能分析

1. 慢请求识别

GET logs-app/_search
{
  "query": {
    "range": {
      "duration_ms": { "gte": 1000 }
    }
  },
  "sort": [
    { "duration_ms": "desc" }
  ],
  "size": 100
}

2. 响应时间分布

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "duration_percentiles": {
      "percentiles": {
        "field": "duration_ms",
        "percents": [50, 75, 90, 95, 99]
      }
    },
    "duration_histogram": {
      "histogram": {
        "field": "duration_ms",
        "interval": 100
      }
    }
  }
}

3. 服务性能对比

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "services": {
      "terms": {
        "field": "service.keyword",
        "size": 20
      },
      "aggs": {
        "avg_duration": {
          "avg": { "field": "duration_ms" }
        },
        "p99_duration": {
          "percentiles": {
            "field": "duration_ms",
            "percents": [99]
          }
        },
        "request_count": {
          "value_count": { "field": "_id" }
        }
      }
    }
  }
}

4. 性能趋势

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m"
      },
      "aggs": {
        "p99": {
          "percentiles": {
            "field": "duration_ms",
            "percents": [99]
          }
        }
      }
    }
  }
}

场景三:用户行为分析

1. 用户访问路径

GET logs-app/_search
{
  "query": {
    "term": { "user_id": "user-001" }
  },
  "sort": [
    { "@timestamp": "asc" }
  ],
  "_source": ["@timestamp", "action", "path", "duration_ms"]
}

2. 热门接口

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "top_paths": {
      "terms": {
        "field": "path.keyword",
        "size": 20
      },
      "aggs": {
        "avg_duration": {
          "avg": { "field": "duration_ms" }
        }
      }
    }
  }
}

3. 用户活跃度

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "active_users": {
      "cardinality": {
        "field": "user_id"
      }
    },
    "users_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1h"
      },
      "aggs": {
        "unique_users": {
          "cardinality": {
            "field": "user_id"
          }
        }
      }
    }
  }
}

4. 地理分布

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "geo_distribution": {
      "terms": {
        "field": "geoip_country.keyword",
        "size": 20
      }
    }
  }
}

场景四:安全审计

1. 登录失败分析

GET logs-app/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "action": "login" } },
        { "term": { "status": "failed" } }
      ]
    }
  },
  "aggs": {
    "by_ip": {
      "terms": {
        "field": "client_ip.keyword",
        "size": 20
      }
    },
    "by_user": {
      "terms": {
        "field": "username.keyword",
        "size": 20
      }
    }
  }
}

2. 异常访问检测

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "ips": {
      "terms": {
        "field": "client_ip.keyword",
        "size": 100,
        "order": { "_count": "desc" }
      },
      "aggs": {
        "request_rate": {
          "rate": {
            "field": "@timestamp",
            "interval": "1m"
          }
        }
      }
    }
  }
}

3. 敏感操作审计

GET logs-app/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "action": "delete" } },
        { "term": { "action": "update_permissions" } },
        { "term": { "action": "export_data" } }
      ]
    }
  },
  "sort": [
    { "@timestamp": "desc" }
  ]
}

场景五:容量规划

1. 存储增长趋势

GET _cat/indices?v&h=index,store.size,docs.count

GET logs-app/_stats

2. 索引大小分析

GET _cat/indices/logs-*?v&h=index,store.size,docs.count&s=store.size:desc

3. 按服务统计日志量

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "by_service": {
      "terms": {
        "field": "service.keyword",
        "size": 20
      },
      "aggs": {
        "daily_volume": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "day"
          }
        }
      }
    }
  }
}

日志分析最佳实践

1. 建立基线

// 记录正常状态
PUT logs-baseline/_doc/1
{
  "service": "user-service",
  "normal_error_rate": 0.5,
  "normal_p99_latency": 200,
  "normal_qps": 1000
}

2. 异常检测

GET logs-app/_search
{
  "size": 0,
  "aggs": {
    "current_error_rate": {
      "filter": {
        "term": { "level": "ERROR" }
      }
    },
    "total_requests": {
      "value_count": { "field": "_id" }
    }
  }
}

3. 告警阈值

指标 正常范围 警告 严重
错误率 < 1% 1-5% > 5%
P99 延迟 < 500ms 500-1000ms > 1000ms
QPS 下降 < 10% 10-30% > 30%

小结

日志分析实战场景:

  • 错误分析:发现、聚类、关联
  • 性能分析:慢请求、分布、趋势
  • 用户行为:路径、热门接口、活跃度
  • 安全审计:登录失败、异常访问
  • 容量规划:存储趋势、日志量统计

下一章我们将学习性能优化。