Docker-first inference stack for chat + embeddings, with one API gateway and multiple model backends.
- Unified FastAPI gateway:
model_inference/model_inference.py - Backend clients:
model_inference/llm_client.pymodel_inference/embedding_client.pymodel_inference/models.py
- Docker build files:
deployment/model_inference/Dockerfile.apideployment/model_inference/Dockerfile.vllm.cudadeployment/model_inference/Dockerfile.vllm.rocmdeployment/model_inference/Dockerfile.llama.cudadeployment/model_inference/Dockerfile.llama.rocm
- Compose orchestration:
docker-compose.yml
- vLLM routing with both CUDA and ROCm profiles is supported.
- CUDA memory constraints/defaults are tuned for an RTX 4090 Laptop GPU (16 GB VRAM).
- ROCm compose defaults are optimized for WSL2 passthrough (
/dev/dxg+ WSL-mounted ROCm bridge libs). - ROCm GPU-utilization defaults are tuned for an RX 9070 XT (16 GB VRAM).
- For native Linux ROCm, passthrough must be adjusted to
/dev/kfdand/dev/dri. llama.cppneeds further hardening/fixes (especially CUDA build/link behavior) before relying on it in production.
- Primary development/runtime profile is WSL2.
- CUDA profile tuning target GPU: NVIDIA RTX 4090 Laptop GPU (16 GB VRAM).
- ROCm profile tuning target GPU: AMD Radeon RX 9070 XT (16 GB VRAM).
- Native Linux ROCm passthrough alternatives are intentionally kept as commented guidance in
docker-compose.yml(and related inference container definitions), so you can switch from WSL2/dev/dxgmapping to/dev/kfd+/dev/dri.
- Chat LLM:
Qwen/Qwen2.5-7B-Instruct-AWQ - Embeddings:
intfloat/multilingual-e5-large-instructQwen/Qwen3-Embedding-4B
Prerequisites:
- Docker + Docker Compose
- NVIDIA setup for
cudaprofile, or ROCm setup forrocmprofile - ROCm note: current vLLM compose mappings are WSL2-first. For native Linux, switch device passthrough in
docker-compose.ymlas annotated in comments.
Start API + CUDA vLLM services:
docker compose --profile cuda up --build -d \
model_inference \
vllm_llm_cuda_qwen2-5_7b_instruct_awq \
vllm_embed_cuda_intfloat_multilingual_e5_large_instructOptional CUDA Qwen embedding service:
docker compose --profile cuda up --build -d vllm_embed_cuda_qwen3_embedding_4bStart API + ROCm vLLM services:
docker compose --profile rocm up --build -d \
model_inference \
vllm_llm_rocm_qwen2-5_7b_instruct_awq \
vllm_embed_rocm_intfloat_multilingual_e5_large_instructGET /healthPOST /chatPOST /chat/streamPOST /embeddingsPOST /mcp/callGET /events
API host port: http://localhost:8080
Health:
curl -s http://localhost:8080/healthChat (explicit model + route):
curl -s -X POST http://localhost:8080/chat \
-H "Content-Type: application/json" \
-d '{
"message": "what is the capital of oregon usa?",
"session_id": null,
"system": null,
"temperature": 0.2,
"max_tokens": 512,
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"backend": "vllm",
"framework": "cuda"
}'Embeddings:
curl -s -X POST http://localhost:8080/embeddings \
-H "Content-Type: application/json" \
-d '{
"texts": ["hello world"],
"model": "Qwen/Qwen3-Embedding-4B",
"normalize": false,
"prefix": "query",
"backend": "vllm",
"framework": "rocm"
}'modelin/chatis optional. If omitted on vLLM route, it defaults toQwen/Qwen2.5-7B-Instruct-AWQ.- Qwen embedding requests are routed to the
7001embedding endpoint when available. - For llama services, place GGUF files in
deployment/model_inference/compatible_weights/. - Keep the ROCm comments in
docker-compose.ymlas-is: they document WSL2 defaults and the native Linux passthrough alternative.
