Sunbelt Computer Software

ModelServer

Docker-first inference stack for chat + embeddings, with one API gateway and multiple model backends.

What is included

Unified FastAPI gateway: model_inference/model_inference.py
Backend clients:
- model_inference/llm_client.py
- model_inference/embedding_client.py
- model_inference/models.py
Docker build files:
- deployment/model_inference/Dockerfile.api
- deployment/model_inference/Dockerfile.vllm.cuda
- deployment/model_inference/Dockerfile.vllm.rocm
- deployment/model_inference/Dockerfile.llama.cuda
- deployment/model_inference/Dockerfile.llama.rocm
Compose orchestration: docker-compose.yml

Current status

vLLM routing with both CUDA and ROCm profiles is supported.
CUDA memory constraints/defaults are tuned for an RTX 4090 Laptop GPU (16 GB VRAM).
ROCm compose defaults are optimized for WSL2 passthrough (/dev/dxg + WSL-mounted ROCm bridge libs).
ROCm GPU-utilization defaults are tuned for an RX 9070 XT (16 GB VRAM).
For native Linux ROCm, passthrough must be adjusted to /dev/kfd and /dev/dri.
llama.cpp needs further hardening/fixes (especially CUDA build/link behavior) before relying on it in production.

Target hardware and platform

Primary development/runtime profile is WSL2.
CUDA profile tuning target GPU: NVIDIA RTX 4090 Laptop GPU (16 GB VRAM).
ROCm profile tuning target GPU: AMD Radeon RX 9070 XT (16 GB VRAM).
Native Linux ROCm passthrough alternatives are intentionally kept as commented guidance in docker-compose.yml (and related inference container definitions), so you can switch from WSL2 /dev/dxg mapping to /dev/kfd + /dev/dri.

Models

Chat LLM: Qwen/Qwen2.5-7B-Instruct-AWQ
Embeddings:
- intfloat/multilingual-e5-large-instruct
- Qwen/Qwen3-Embedding-4B

Run

Prerequisites:

Docker + Docker Compose
NVIDIA setup for cuda profile, or ROCm setup for rocm profile
ROCm note: current vLLM compose mappings are WSL2-first. For native Linux, switch device passthrough in docker-compose.yml as annotated in comments.

Start API + CUDA vLLM services:

docker compose --profile cuda up --build -d \
  model_inference \
  vllm_llm_cuda_qwen2-5_7b_instruct_awq \
  vllm_embed_cuda_intfloat_multilingual_e5_large_instruct

Optional CUDA Qwen embedding service:

docker compose --profile cuda up --build -d vllm_embed_cuda_qwen3_embedding_4b

Start API + ROCm vLLM services:

docker compose --profile rocm up --build -d \
  model_inference \
  vllm_llm_rocm_qwen2-5_7b_instruct_awq \
  vllm_embed_rocm_intfloat_multilingual_e5_large_instruct

API endpoints

GET /health
POST /chat
POST /chat/stream
POST /embeddings
POST /mcp/call
GET /events

API host port: http://localhost:8080

Sample API usage

Health:

curl -s http://localhost:8080/health

Chat (explicit model + route):

curl -s -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "what is the capital of oregon usa?",
    "session_id": null,
    "system": null,
    "temperature": 0.2,
    "max_tokens": 512,
    "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
    "backend": "vllm",
    "framework": "cuda"
  }'

Embeddings:

curl -s -X POST http://localhost:8080/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["hello world"],
    "model": "Qwen/Qwen3-Embedding-4B",
    "normalize": false,
    "prefix": "query",
    "backend": "vllm",
    "framework": "rocm"
  }'

Notes

model in /chat is optional. If omitted on vLLM route, it defaults to Qwen/Qwen2.5-7B-Instruct-AWQ.
Qwen embedding requests are routed to the 7001 embedding endpoint when available.
For llama services, place GGUF files in deployment/model_inference/compatible_weights/.
Keep the ROCm comments in docker-compose.yml as-is: they document WSL2 defaults and the native Linux passthrough alternative.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
deployment/model_inference		deployment/model_inference
model_inference		model_inference
README.md		README.md
docker-compose.yml		docker-compose.yml

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelServer

What is included

Current status

Target hardware and platform

Models

Run

API endpoints

Sample API usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

ModelServer

What is included

Current status

Target hardware and platform

Models

Run

API endpoints

Sample API usage

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages