跳转至

第八章:生产实践

本章总结 ELK Stack 在生产环境中的最佳实践。

架构设计

高可用架构

┌─────────────────────────────────────────────────────────────┐
│                    Production Architecture                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    Load Balancer                     │   │
│  └───────────────────────┬─────────────────────────────┘   │
│                          │                                 │
│  ┌───────────────────────┼─────────────────────────────┐   │
│  │              Fluent Bit DaemonSet                    │   │
│  │  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐       │   │
│  │  │Node1│  │Node2│  │Node3│  │Node4│  │Node5│       │   │
│  │  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘       │   │
│  └─────┼────────┼────────┼────────┼────────┼───────────┘   │
│        │        │        │        │        │               │
│        └────────┴────────┼────────┴────────┘               │
│                          ▼                                 │
│  ┌───────────────────────────────────────────────────────┐ │
│  │              Kafka (缓冲层,可选)                      │ │
│  └───────────────────────────┬───────────────────────────┘ │
│                              │                             │
│  ┌───────────────────────────┼───────────────────────────┐ │
│  │              Elasticsearch Cluster                     │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │ │
│  │  │ Master-1 │  │ Master-2 │  │ Master-3 │            │ │
│  │  └──────────┘  └──────────┘  └──────────┘            │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │ │
│  │  │  Data-1  │  │  Data-2  │  │  Data-3  │            │ │
│  │  └──────────┘  └──────────┘  └──────────┘            │ │
│  └───────────────────────────────────────────────────────┘ │
│                              │                             │
│  ┌───────────────────────────┼───────────────────────────┐ │
│  │              Kibana (多实例)                           │ │
│  │  ┌──────────┐  ┌──────────┐                          │ │
│  │  │ Kibana-1 │  │ Kibana-2 │                          │ │
│  │  └──────────┘  └──────────┘                          │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
└─────────────────────────────────────────────────────────────┘

容量规划

日志量 Elasticsearch 存储 Fluent Bit
10GB/天 3 节点, 16GB 500GB 每节点 100MB 内存
100GB/天 6 节点, 32GB 5TB 每节点 200MB 内存
1TB/天 12 节点, 64GB 50TB 每节点 500MB 内存

部署清单

Elasticsearch

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
  namespace: logging
spec:
  version: 8.11.0
  nodeSets:
  - name: master
    count: 3
    config:
      node.roles: ["master"]
    resources:
      requests:
        memory: 4Gi
        cpu: 2
  - name: data-hot
    count: 3
    config:
      node.roles: ["data_hot", "data_content"]
    resources:
      requests:
        memory: 32Gi
        cpu: 8
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Ti
        storageClassName: fast-ssd
  - name: data-warm
    count: 2
    config:
      node.roles: ["data_warm"]
    resources:
      requests:
        memory: 16Gi
        cpu: 4
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Ti
        storageClassName: standard

Fluent Bit

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    spec:
      priorityClassName: system-node-critical
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:2.2
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Kibana

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: logging
spec:
  version: 8.11.0
  count: 2
  elasticsearchRef:
    name: elasticsearch
  resources:
    requests:
      memory: 2Gi
      cpu: 1
  podTemplate:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  kibana.k8s.elastic.co/name: kibana
              topologyKey: kubernetes.io/hostname

安全配置

网络隔离

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: elasticsearch-network-policy
  namespace: logging
spec:
  podSelector:
    matchLabels:
      common.k8s.elastic.co/type: elasticsearch
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: logging
    - podSelector:
        matchLabels:
          app: fluent-bit
    - podSelector:
        matchLabels:
          common.k8s.elastic.co/type: kibana
    ports:
    - protocol: TCP
      port: 9200

RBAC 配置

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources: ["pods", "namespaces"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: logging

备份与恢复

快照配置

// 注册快照仓库
PUT _snapshot/backup-repo
{
  "type": "s3",
  "settings": {
    "bucket": "elasticsearch-backups",
    "region": "us-east-1",
    "base_path": "snapshots"
  }
}

// 创建快照
PUT _snapshot/backup-repo/snapshot-2024-01-15
{
  "indices": "logs-*",
  "ignore_unavailable": true,
  "include_global_state": false
}

// 恢复快照
POST _snapshot/backup-repo/snapshot-2024-01-15/_restore
{
  "indices": "logs-*",
  "ignore_unavailable": true
}

自动快照

PUT _slm/policy/nightly-snapshots
{
  "schedule": "0 30 1 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "backup-repo",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

运维手册

日常检查

# 1. 集群健康
curl -X GET "elasticsearch:9200/_cluster/health?pretty"

# 2. 节点状态
curl -X GET "elasticsearch:9200/_cat/nodes?v"

# 3. 索引状态
curl -X GET "elasticsearch:9200/_cat/indices?v&health=yellow"

# 4. 磁盘使用
curl -X GET "elasticsearch:9200/_cat/allocation?v"

# 5. Fluent Bit 状态
kubectl logs -n logging -l app=fluent-bit --tail=100

故障处理

集群 RED:

# 检查未分配分片
curl -X GET "elasticsearch:9200/_cat/shards?v&state=UNASSIGNED"

# 手动分配
curl -X POST "elasticsearch:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-app",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}'

磁盘满:

# 删除旧索引
curl -X DELETE "elasticsearch:9200/logs-app-2024.01.01"

# 或更新 ILM 策略
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "7d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

JVM OOM:

# 检查堆使用
curl -X GET "elasticsearch:9200/_nodes/stats/jvm?pretty"

# 减少分片数
PUT logs-app/_settings
{
  "number_of_replicas": 0
}

性能基准

写入性能

配置 性能
单节点,默认 5K docs/s
3 节点,优化 50K docs/s
6 节点,批量 200K docs/s

搜索性能

配置 性能
单分片 1K QPS
10 分片 10K QPS
10 分片 + 缓存 50K QPS

最佳实践总结

1. 架构设计

  • 多节点高可用
  • 分层存储(热/温/冷)
  • 缓冲层(Kafka)

2. 性能优化

  • 合理分片数
  • 批量写入
  • 查询缓存

3. 安全加固

  • 网络隔离
  • RBAC 权限
  • 数据加密

4. 运维保障

  • 自动备份
  • 监控告警
  • 故障预案

5. 成本控制

  • ILM 生命周期
  • 冷数据归档
  • 资源限制

小结

ELK Stack 生产实践要点:

  • 架构设计:高可用、分层存储
  • 部署配置:资源规划、安全加固
  • 备份恢复:快照、SLM
  • 运维手册:日常检查、故障处理
  • 性能基准:写入、搜索性能

完成本教程后,你应该能够在生产环境中部署和管理 ELK Stack 日志系统。