第八章：生产实践¶

本章总结 ELK Stack 在生产环境中的最佳实践。

架构设计¶

高可用架构¶

┌─────────────────────────────────────────────────────────────┐
│                    Production Architecture                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                    Load Balancer                     │   │
│  └───────────────────────┬─────────────────────────────┘   │
│                          │                                 │
│  ┌───────────────────────┼─────────────────────────────┐   │
│  │              Fluent Bit DaemonSet                    │   │
│  │  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐       │   │
│  │  │Node1│  │Node2│  │Node3│  │Node4│  │Node5│       │   │
│  │  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘       │   │
│  └─────┼────────┼────────┼────────┼────────┼───────────┘   │
│        │        │        │        │        │               │
│        └────────┴────────┼────────┴────────┘               │
│                          ▼                                 │
│  ┌───────────────────────────────────────────────────────┐ │
│  │              Kafka (缓冲层，可选)                      │ │
│  └───────────────────────────┬───────────────────────────┘ │
│                              │                             │
│  ┌───────────────────────────┼───────────────────────────┐ │
│  │              Elasticsearch Cluster                     │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │ │
│  │  │ Master-1 │  │ Master-2 │  │ Master-3 │            │ │
│  │  └──────────┘  └──────────┘  └──────────┘            │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │ │
│  │  │  Data-1  │  │  Data-2  │  │  Data-3  │            │ │
│  │  └──────────┘  └──────────┘  └──────────┘            │ │
│  └───────────────────────────────────────────────────────┘ │
│                              │                             │
│  ┌───────────────────────────┼───────────────────────────┐ │
│  │              Kibana (多实例)                           │ │
│  │  ┌──────────┐  ┌──────────┐                          │ │
│  │  │ Kibana-1 │  │ Kibana-2 │                          │ │
│  │  └──────────┘  └──────────┘                          │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
└─────────────────────────────────────────────────────────────┘

容量规划¶

日志量	Elasticsearch	存储	Fluent Bit
10GB/天	3 节点, 16GB	500GB	每节点 100MB 内存
100GB/天	6 节点, 32GB	5TB	每节点 200MB 内存
1TB/天	12 节点, 64GB	50TB	每节点 500MB 内存

部署清单¶

Elasticsearch¶

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
  namespace: logging
spec:
  version: 8.11.0
  nodeSets:
  - name: master
    count: 3
    config:
      node.roles: ["master"]
    resources:
      requests:
        memory: 4Gi
        cpu: 2
  - name: data-hot
    count: 3
    config:
      node.roles: ["data_hot", "data_content"]
    resources:
      requests:
        memory: 32Gi
        cpu: 8
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Ti
        storageClassName: fast-ssd
  - name: data-warm
    count: 2
    config:
      node.roles: ["data_warm"]
    resources:
      requests:
        memory: 16Gi
        cpu: 4
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Ti
        storageClassName: standard

Fluent Bit¶

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    spec:
      priorityClassName: system-node-critical
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:2.2
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Kibana¶

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: logging
spec:
  version: 8.11.0
  count: 2
  elasticsearchRef:
    name: elasticsearch
  resources:
    requests:
      memory: 2Gi
      cpu: 1
  podTemplate:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  kibana.k8s.elastic.co/name: kibana
              topologyKey: kubernetes.io/hostname

安全配置¶

网络隔离¶

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: elasticsearch-network-policy
  namespace: logging
spec:
  podSelector:
    matchLabels:
      common.k8s.elastic.co/type: elasticsearch
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: logging
    - podSelector:
        matchLabels:
          app: fluent-bit
    - podSelector:
        matchLabels:
          common.k8s.elastic.co/type: kibana
    ports:
    - protocol: TCP
      port: 9200

RBAC 配置¶

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources: ["pods", "namespaces"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: logging

备份与恢复¶

快照配置¶

// 注册快照仓库
PUT _snapshot/backup-repo
{
  "type": "s3",
  "settings": {
    "bucket": "elasticsearch-backups",
    "region": "us-east-1",
    "base_path": "snapshots"
  }
}

// 创建快照
PUT _snapshot/backup-repo/snapshot-2024-01-15
{
  "indices": "logs-*",
  "ignore_unavailable": true,
  "include_global_state": false
}

// 恢复快照
POST _snapshot/backup-repo/snapshot-2024-01-15/_restore
{
  "indices": "logs-*",
  "ignore_unavailable": true
}

自动快照¶

PUT _slm/policy/nightly-snapshots
{
  "schedule": "0 30 1 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "backup-repo",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

运维手册¶

日常检查¶

# 1. 集群健康
curl -X GET "elasticsearch:9200/_cluster/health?pretty"

# 2. 节点状态
curl -X GET "elasticsearch:9200/_cat/nodes?v"

# 3. 索引状态
curl -X GET "elasticsearch:9200/_cat/indices?v&health=yellow"

# 4. 磁盘使用
curl -X GET "elasticsearch:9200/_cat/allocation?v"

# 5. Fluent Bit 状态
kubectl logs -n logging -l app=fluent-bit --tail=100

故障处理¶

集群 RED：

# 检查未分配分片
curl -X GET "elasticsearch:9200/_cat/shards?v&state=UNASSIGNED"

# 手动分配
curl -X POST "elasticsearch:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "logs-app",
        "shard": 0,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}'

磁盘满：

# 删除旧索引
curl -X DELETE "elasticsearch:9200/logs-app-2024.01.01"

# 或更新 ILM 策略
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "7d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

JVM OOM：

# 检查堆使用
curl -X GET "elasticsearch:9200/_nodes/stats/jvm?pretty"

# 减少分片数
PUT logs-app/_settings
{
  "number_of_replicas": 0
}

性能基准¶

写入性能¶

配置	性能
单节点，默认	5K docs/s
3 节点，优化	50K docs/s
6 节点，批量	200K docs/s

搜索性能¶

配置	性能
单分片	1K QPS
10 分片	10K QPS
10 分片 + 缓存	50K QPS

最佳实践总结¶

1. 架构设计¶

多节点高可用
分层存储（热/温/冷）
缓冲层（Kafka）

2. 性能优化¶

合理分片数
批量写入
查询缓存

3. 安全加固¶

网络隔离
RBAC 权限
数据加密

4. 运维保障¶

自动备份
监控告警
故障预案

5. 成本控制¶

ILM 生命周期
冷数据归档
资源限制

小结¶

ELK Stack 生产实践要点：

架构设计：高可用、分层存储
部署配置：资源规划、安全加固
备份恢复：快照、SLM
运维手册：日常检查、故障处理
性能基准：写入、搜索性能

完成本教程后，你应该能够在生产环境中部署和管理 ELK Stack 日志系统。