第七章:生产部署¶
Docker 部署¶
Dockerfile¶
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN pip3 install vllm
EXPOSE 8000
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "Qwen/Qwen2-7B-Instruct", \
"--host", "0.0.0.0", \
"--port", "8000"]
Docker Compose¶
version: '3.8'
services:
vllm:
build: .
ports:
- "8000:8000"
environment:
- MODEL_NAME=Qwen/Qwen2-7B-Instruct
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
--model Qwen/Qwen2-7B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 4096
Kubernetes 部署¶
Deployment¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- Qwen/Qwen2-7B-Instruct
- --tensor-parallel-size
- "2"
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
Service¶
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer
监控告警¶
Prometheus 配置¶
# prometheus.yml
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm-service:8000']
告警规则¶
groups:
- name: vllm-alerts
rules:
- alert: HighGPUUsage
expr: vllm:gpu_cache_usage_perc > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "GPU 缓存使用率过高"
- alert: HighLatency
expr: vllm:time_to_first_token_seconds > 5
for: 2m
labels:
severity: warning
annotations:
summary: "首 token 延迟过高"
日志管理¶
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# vLLM 日志配置
import vllm.logger as vllm_logger
vllm_logger.setLevel(logging.DEBUG)
健康检查¶
from fastapi import FastAPI
from vllm.engine.arg_utils import EngineArgs
from vllm.engine.llm_engine import LLMEngine
app = FastAPI()
engine = None
@app.on_event("startup")
async def startup():
global engine
args = EngineArgs(model="Qwen/Qwen2-7B-Instruct")
engine = LLMEngine.from_engine_args(args)
@app.get("/health")
async def health():
if engine is None:
return {"status": "unhealthy"}
return {"status": "healthy"}
小结¶
本章学习了 vLLM 的生产部署。至此,vLLM 教程全部完成。