第七章：生产部署¶

Docker 部署¶

Dockerfile¶

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install vllm

EXPOSE 8000

CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "Qwen/Qwen2-7B-Instruct", \
     "--host", "0.0.0.0", \
     "--port", "8000"]

Docker Compose¶

version: '3.8'

services:
  vllm:
    build: .
    ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=Qwen/Qwen2-7B-Instruct
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      --model Qwen/Qwen2-7B-Instruct
      --gpu-memory-utilization 0.9
      --max-model-len 4096

Kubernetes 部署¶

Deployment¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model
        - Qwen/Qwen2-7B-Instruct
        - --tensor-parallel-size
        - "2"
        resources:
          limits:
            nvidia.com/gpu: 2
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

Service¶

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
  type: LoadBalancer

监控告警¶

Prometheus 配置¶

# prometheus.yml
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm-service:8000']

告警规则¶

groups:
  - name: vllm-alerts
    rules:
    - alert: HighGPUUsage
      expr: vllm:gpu_cache_usage_perc > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "GPU 缓存使用率过高"

    - alert: HighLatency
      expr: vllm:time_to_first_token_seconds > 5
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "首 token 延迟过高"

日志管理¶

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# vLLM 日志配置
import vllm.logger as vllm_logger
vllm_logger.setLevel(logging.DEBUG)

健康检查¶

from fastapi import FastAPI
from vllm.engine.arg_utils import EngineArgs
from vllm.engine.llm_engine import LLMEngine

app = FastAPI()
engine = None

@app.on_event("startup")
async def startup():
    global engine
    args = EngineArgs(model="Qwen/Qwen2-7B-Instruct")
    engine = LLMEngine.from_engine_args(args)

@app.get("/health")
async def health():
    if engine is None:
        return {"status": "unhealthy"}
    return {"status": "healthy"}

小结¶

本章学习了 vLLM 的生产部署。至此，vLLM 教程全部完成。