第八章:生产实践¶
本章总结 ELK Stack 在生产环境中的最佳实践。
架构设计¶
高可用架构¶
┌─────────────────────────────────────────────────────────────┐
│ Production Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Load Balancer │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼─────────────────────────────┐ │
│ │ Fluent Bit DaemonSet │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │Node1│ │Node2│ │Node3│ │Node4│ │Node5│ │ │
│ │ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │ │
│ └─────┼────────┼────────┼────────┼────────┼───────────┘ │
│ │ │ │ │ │ │
│ └────────┴────────┼────────┴────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Kafka (缓冲层,可选) │ │
│ └───────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ │ Elasticsearch Cluster │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Master-1 │ │ Master-2 │ │ Master-3 │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Data-1 │ │ Data-2 │ │ Data-3 │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ │ Kibana (多实例) │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Kibana-1 │ │ Kibana-2 │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
容量规划¶
| 日志量 | Elasticsearch | 存储 | Fluent Bit |
|---|---|---|---|
| 10GB/天 | 3 节点, 16GB | 500GB | 每节点 100MB 内存 |
| 100GB/天 | 6 节点, 32GB | 5TB | 每节点 200MB 内存 |
| 1TB/天 | 12 节点, 64GB | 50TB | 每节点 500MB 内存 |
部署清单¶
Elasticsearch¶
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch
namespace: logging
spec:
version: 8.11.0
nodeSets:
- name: master
count: 3
config:
node.roles: ["master"]
resources:
requests:
memory: 4Gi
cpu: 2
- name: data-hot
count: 3
config:
node.roles: ["data_hot", "data_content"]
resources:
requests:
memory: 32Gi
cpu: 8
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: fast-ssd
- name: data-warm
count: 2
config:
node.roles: ["data_warm"]
resources:
requests:
memory: 16Gi
cpu: 4
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Ti
storageClassName: standard
Fluent Bit¶
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
spec:
priorityClassName: system-node-critical
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.2
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Kibana¶
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: kibana
namespace: logging
spec:
version: 8.11.0
count: 2
elasticsearchRef:
name: elasticsearch
resources:
requests:
memory: 2Gi
cpu: 1
podTemplate:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
kibana.k8s.elastic.co/name: kibana
topologyKey: kubernetes.io/hostname
安全配置¶
网络隔离¶
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: elasticsearch-network-policy
namespace: logging
spec:
podSelector:
matchLabels:
common.k8s.elastic.co/type: elasticsearch
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: logging
- podSelector:
matchLabels:
app: fluent-bit
- podSelector:
matchLabels:
common.k8s.elastic.co/type: kibana
ports:
- protocol: TCP
port: 9200
RBAC 配置¶
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluent-bit-read
rules:
- apiGroups: [""]
resources: ["pods", "namespaces"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluent-bit-read
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit-read
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: logging
备份与恢复¶
快照配置¶
// 注册快照仓库
PUT _snapshot/backup-repo
{
"type": "s3",
"settings": {
"bucket": "elasticsearch-backups",
"region": "us-east-1",
"base_path": "snapshots"
}
}
// 创建快照
PUT _snapshot/backup-repo/snapshot-2024-01-15
{
"indices": "logs-*",
"ignore_unavailable": true,
"include_global_state": false
}
// 恢复快照
POST _snapshot/backup-repo/snapshot-2024-01-15/_restore
{
"indices": "logs-*",
"ignore_unavailable": true
}
自动快照¶
PUT _slm/policy/nightly-snapshots
{
"schedule": "0 30 1 * * ?",
"name": "<nightly-snap-{now/d}>",
"repository": "backup-repo",
"config": {
"indices": ["*"],
"ignore_unavailable": true,
"include_global_state": false
},
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 50
}
}
运维手册¶
日常检查¶
# 1. 集群健康
curl -X GET "elasticsearch:9200/_cluster/health?pretty"
# 2. 节点状态
curl -X GET "elasticsearch:9200/_cat/nodes?v"
# 3. 索引状态
curl -X GET "elasticsearch:9200/_cat/indices?v&health=yellow"
# 4. 磁盘使用
curl -X GET "elasticsearch:9200/_cat/allocation?v"
# 5. Fluent Bit 状态
kubectl logs -n logging -l app=fluent-bit --tail=100
故障处理¶
集群 RED:
# 检查未分配分片
curl -X GET "elasticsearch:9200/_cat/shards?v&state=UNASSIGNED"
# 手动分配
curl -X POST "elasticsearch:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
"commands": [
{
"allocate_stale_primary": {
"index": "logs-app",
"shard": 0,
"node": "node-1",
"accept_data_loss": true
}
}
]
}'
磁盘满:
# 删除旧索引
curl -X DELETE "elasticsearch:9200/logs-app-2024.01.01"
# 或更新 ILM 策略
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"delete": {
"min_age": "7d",
"actions": {
"delete": {}
}
}
}
}
}
JVM OOM:
# 检查堆使用
curl -X GET "elasticsearch:9200/_nodes/stats/jvm?pretty"
# 减少分片数
PUT logs-app/_settings
{
"number_of_replicas": 0
}
性能基准¶
写入性能¶
| 配置 | 性能 |
|---|---|
| 单节点,默认 | 5K docs/s |
| 3 节点,优化 | 50K docs/s |
| 6 节点,批量 | 200K docs/s |
搜索性能¶
| 配置 | 性能 |
|---|---|
| 单分片 | 1K QPS |
| 10 分片 | 10K QPS |
| 10 分片 + 缓存 | 50K QPS |
最佳实践总结¶
1. 架构设计¶
- 多节点高可用
- 分层存储(热/温/冷)
- 缓冲层(Kafka)
2. 性能优化¶
- 合理分片数
- 批量写入
- 查询缓存
3. 安全加固¶
- 网络隔离
- RBAC 权限
- 数据加密
4. 运维保障¶
- 自动备份
- 监控告警
- 故障预案
5. 成本控制¶
- ILM 生命周期
- 冷数据归档
- 资源限制
小结¶
ELK Stack 生产实践要点:
- 架构设计:高可用、分层存储
- 部署配置:资源规划、安全加固
- 备份恢复:快照、SLM
- 运维手册:日常检查、故障处理
- 性能基准:写入、搜索性能
完成本教程后,你应该能够在生产环境中部署和管理 ELK Stack 日志系统。