第六章:分布式推理¶
分布式架构¶
┌─────────────────────────────────────────┐
│ Client Request │
└─────────────────┬───────────────────────┘
↓
┌─────────────────────────────────────────┐
│ API Server │
└─────────────────┬───────────────────────┘
↓
┌─────────────────────────────────────────┐
│ vLLM Engine │
├──────────────┬──────────────────────────┤
│ GPU 0 │ GPU 1 │
│ (Shard 0) │ (Shard 1) │
└──────────────┴──────────────────────────┘
张量并行¶
单机多卡¶
# 2 GPU 张量并行
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-72B-Instruct \
--tensor-parallel-size 2
# 4 GPU 张量并行
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-72B-Instruct \
--tensor-parallel-size 4
# 8 GPU 张量并行
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-72B-Instruct \
--tensor-parallel-size 8
Python 代码¶
流水线并行¶
多节点部署¶
Ray 集群¶
# 启动 Ray 集群
# 节点1(Head)
ray start --head --port=6379
# 节点2(Worker)
ray start --address=<head-node-ip>:6379
# 启动 vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-72B-Instruct \
--tensor-parallel-size 8 \
--ray-address=auto
分布式配置¶
from vllm import LLM, EngineArgs
args = EngineArgs(
model="Qwen/Qwen2-72B-Instruct",
tensor_parallel_size=8,
pipeline_parallel_size=2,
distributed_executor_backend="ray",
)
llm = LLM(**vars(args))
负载均衡¶
Nginx 配置¶
upstream vllm_backends {
least_conn;
server 192.168.1.1:8000;
server 192.168.1.2:8000;
server 192.168.1.3:8000;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_backends;
proxy_set_header Host $host;
}
}
高可用部署¶
# docker-compose.yml
version: '3.8'
services:
vllm-1:
image: vllm/vllm-openai:latest
command: --model Qwen/Qwen2-7B-Instruct --tensor-parallel-size 2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
vllm-2:
image: vllm/vllm-openai:latest
command: --model Qwen/Qwen2-7B-Instruct --tensor-parallel-size 2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- vllm-1
- vllm-2
小结¶
本章学习了 vLLM 的分布式推理。下一章学习生产部署。