第六章:性能优化¶
本章介绍分布式追踪系统的性能优化策略。
采样优化¶
自适应采样¶
# Jaeger 自适应采样配置
sampling:
type: adaptive
strategies:
- service: api-gateway
type: probabilistic
param: 0.5
- service: user-service
type: probabilistic
param: 0.3
- service: order-service
type: probabilistic
param: 0.1
智能采样¶
class SmartSampler:
def __init__(self, base_rate=0.1):
self.base_rate = base_rate
self.error_rate = 1.0 # 错误请求全量采样
self.slow_threshold = 1000 # 慢请求阈值
def should_sample(self, span):
# 错误请求全量采样
if span.get('error'):
return True
# 慢请求全量采样
if span['duration'] > self.slow_threshold:
return True
# 关键操作全量采样
if span['operation_name'] in ['login', 'payment']:
return True
# 其他请求按比例采样
return random.random() < self.base_rate
限流采样¶
from ratelimit import RateLimiter
class RateLimitSampler:
def __init__(self, rate_per_second=100):
self.limiter = RateLimiter(rate_per_second)
def should_sample(self):
try:
self.limiter.acquire()
return True
except:
return False
存储优化¶
索引优化¶
// Elasticsearch 索引设置
PUT jaeger-span-*
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.translog.durability": "async",
"index.translog.sync_interval": "30s"
}
}
数据保留策略¶
# Jaeger 数据保留
esIndexCleaner:
enabled: true
numberOfDays: 7
schedule: "0 0 * * *"
# 分层存储
esRollover:
enabled: true
schedule: "0 0 * * *"
conditions:
max_age: "1d"
max_docs: 10000000
冷热数据分离¶
# ILM 策略
PUT _ilm/policy/jaeger-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "50gb"
}
}
},
"warm": {
"min_age": "3d",
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
}
}
},
"delete": {
"min_age": "7d",
"actions": {
"delete": {}
}
}
}
}
}
Collector 优化¶
批量处理¶
# OpenTelemetry Collector 配置
processors:
batch:
timeout: 1s
send_batch_size: 1024
send_batch_max_size: 2048
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
并行处理¶
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
telemetry:
metrics:
level: detailed
缓存优化¶
from functools import lru_cache
class SpanCache:
def __init__(self, maxsize=10000):
self.cache = lru_cache(maxsize=maxsize)
@lru_cache(maxsize=10000)
def get_service_name(self, service_id):
return self.service_registry.get(service_id)
def clear(self):
self.cache.cache_clear()
Agent 优化¶
内存优化¶
批量发送¶
class BatchSpanExporter:
def __init__(self, max_batch_size=100, max_wait_time=5):
self.max_batch_size = max_batch_size
self.max_wait_time = max_wait_time
self.buffer = []
self.last_export = time.time()
def export(self, span):
self.buffer.append(span)
if len(self.buffer) >= self.max_batch_size or \
time.time() - self.last_export > self.max_wait_time:
self._flush()
def _flush(self):
if self.buffer:
self.exporter.export(self.buffer)
self.buffer = []
self.last_export = time.time()
查询优化¶
索引设计¶
// 优化查询性能的映射
PUT jaeger-span
{
"mappings": {
"properties": {
"traceID": { "type": "keyword" },
"spanID": { "type": "keyword" },
"operationName": {
"type": "keyword",
"ignore_above": 256
},
"startTime": { "type": "date" },
"duration": { "type": "long" },
"tags": {
"type": "nested",
"properties": {
"key": { "type": "keyword" },
"value": { "type": "keyword" }
}
}
}
}
}
查询优化¶
// 使用 filter 代替 query
GET jaeger-span-*/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "process.serviceName": "user-service" } },
{ "range": { "startTime": { "gte": "now-1h" } } }
]
}
},
"size": 100
}
缓存查询结果¶
from cachetools import TTLCache
class TraceQueryCache:
def __init__(self, maxsize=1000, ttl=60):
self.cache = TTLCache(maxsize=maxsize, ttl=ttl)
def get_trace(self, trace_id):
if trace_id in self.cache:
return self.cache[trace_id]
trace = self.query_trace(trace_id)
self.cache[trace_id] = trace
return trace
性能监控¶
关键指标¶
| 指标 | 说明 | 告警阈值 |
|---|---|---|
| span_ingestion_rate | Span 摄入速率 | > 100K/s |
| span_latency | Span 处理延迟 | P99 > 100ms |
| storage_size | 存储大小 | > 80% |
| query_latency | 查询延迟 | P99 > 1s |
| error_rate | 错误率 | > 1% |
监控配置¶
# Prometheus 规则
groups:
- name: tracing-alerts
rules:
- alert: HighSpanIngestionRate
expr: rate(jaeger_collector_spans_received_total[5m]) > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "High span ingestion rate"
- alert: HighQueryLatency
expr: histogram_quantile(0.99, rate(jaeger_query_latency_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High query latency"
小结¶
性能优化要点:
- 采样优化:自适应、智能、限流
- 存储优化:索引、保留、分层
- Collector 优化:批量、并行、缓存
- Agent 优化:内存、批量发送
- 查询优化:索引、过滤、缓存
下一章我们将学习告警与监控。