A production-grade model serving layer that wraps Ollama behind an OpenAI-compatible API with request batching, SSE streaming, backpressure, graceful degradation, and Prometheus metrics.
Client (curl / OpenAI SDK)
│
▼
┌─────────────────────────────────────┐
│ Caddy (reverse proxy, auto-TLS) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ FastAPI Server │
│ ├─ /v1/chat/completions │ ← OpenAI-compatible endpoint
│ ├─ /health, /ready │ ← Health probes
│ ├─ /metrics │ ← Prometheus scrape
│ │ │
│ Bounded Request Queue │
│ ┌────────────────────────┐ │
│ │ max_size=50, 503 when │ │ ← Backpressure via Retry-After
│ │ full │ │
│ └─────────┬──────────────┘ │
│ │ │
│ Batch Dispatcher │
│ ┌─────────▼──────────────┐ │
│ │ Collect for 100ms or │ │ ← Naive batching (time/size)
│ │ 8 requests │ │
│ └─────────┬──────────────┘ │
│ │ │
│ InferenceBackend Protocol │
│ ┌─────────▼──────────────┐ │
│ │ OllamaBackend │ │ ← HTTP client to Ollama
│ └────────────────────────┘ │
└─────────────────────────────────────┘
│
▼
┌──────────────┐ ┌───────────────┐
│ Ollama │ │ Grafana Alloy│ → Grafana Cloud
│ (llama3.2) │ │ (telemetry) │ (metrics + logs)
└──────────────┘ └───────────────┘
- OpenAI-compatible API —
/v1/chat/completionswith streaming and non-streaming modes - SSE streaming — Token-by-token delivery in OpenAI chunk format
- Request batching — Naive batching with configurable time/size thresholds, concurrent dispatch
- Backpressure — Bounded request queue; returns 503 +
Retry-Afterwhen full - Prometheus metrics — 11 custom inference metrics: TTFT, tokens/sec, queue depth, error rates, latency histograms
- Observability — Grafana Alloy ships metrics and logs to Grafana Cloud
- Graceful shutdown — SIGTERM → reject new requests → drain in-flight → close backend
- API key auth — Optional Bearer token authentication
- Model warm-up — Dummy request on startup to pre-load the model into memory
- Structured logging — JSON logs via structlog with per-request
request_id, model, and stream context - Health checks —
/health(liveness) and/ready(readiness with backend + queue status) - Error handling — Proper HTTP status codes (401, 404, 502, 503, 504) with structured error responses
- Backend abstraction —
InferenceBackendProtocol enables swapping Ollama for other backends
- Docker and Docker Compose
- ~4 GB disk space for the llama3.2 model
# Start the stack (API + Ollama + Caddy)
docker compose up -d
# Pull a model (first time only)
docker exec model-serving-ollama ollama pull llama3.2
# Wait for readiness
curl http://localhost:8000/readyNon-streaming:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}' | python -m json.toolStreaming:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Benchmarks run against the production server (Hetzner CAX21, ARM64, 8GB RAM) running llama3.2 (3B). Regenerate with scripts/experiments.py.
| Metric | Value |
|---|---|
| Token generation rate | 7.2–7.5 tok/s (consistent across runs) |
| Sustained throughput (concurrency=5, 2 min) | 56 requests, 990 tokens, 7.6 tok/s |
| Baseline latency (single request) | 2.0–2.9s |
| Concurrency | Avg Latency | Success Rate |
|---|---|---|
| 1 | 7.4s | 100% |
| 5 | 10.0s | 100% |
| 10 | 18.7s | 100% |
| 20 | 42.9s | 100% |
| 30 | 20.9s | 27% (22 timeouts) |
| 50 | 43.7s | 16% (42 timeouts) |
| 60 | 39.2s | 20% (10 rejected, 38 timeouts) |
At concurrency 30+, requests start timing out (60s limit) because Ollama processes sequentially. At 60, the queue (max=50) rejects the overflow with 503.
100 simultaneous requests against a queue of 50:
- 50 rejected instantly with 503 (avg response time: 0.87s)
- 50 accepted into the queue — all timed out (504) because Ollama can't process 50 requests in 60s
The queue protects the system: rejected clients get an instant response and can retry, instead of waiting indefinitely.
| Mode | Sequential Latency | Concurrent(5) Latency | TTFT |
|---|---|---|---|
| Non-streaming | 2.85s | 12.29s | — |
| Streaming | 3.03s | 8.35s | 0.85s (seq) / 5.99s (conc) |
Streaming is faster under concurrency because it bypasses the batch dispatcher and goes directly to the backend.
| Condition | Mean TTFT | Min | Max |
|---|---|---|---|
| No contention (sequential) | 0.87s | 0.53s | 0.93s |
| Concurrency=5 | 9.02s | 1.08s | 10.83s |
TTFT degrades ~10x under contention because requests queue behind each other at Ollama.
| Prompt | Avg Latency |
|---|---|
| Short (5 tokens) | 2.0s |
| Long (~50 tokens) | 4.9s |
| 5-turn conversation | 5.9s |
| 10-turn conversation | 7.8s |
All settings via environment variables:
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
llama3.2 |
Default model name |
OLLAMA_URL |
http://ollama:11434 |
Ollama backend URL |
MAX_QUEUE_SIZE |
50 |
Max concurrent requests before 503 |
BATCH_WAIT_MS |
100 |
Max time to wait for batch to fill |
BATCH_MAX_SIZE |
8 |
Max requests per batch |
REQUEST_TIMEOUT_SECONDS |
60 |
Per-request timeout |
SHUTDOWN_GRACE_SECONDS |
30 |
Drain timeout on shutdown |
API_KEY |
(none) | Bearer token for auth (disabled if unset) |
HOST |
0.0.0.0 |
Server bind address |
PORT |
8000 |
Server port |
LOG_LEVEL |
info |
Log level |
OpenAI-compatible chat completion. Supports stream: true for SSE.
Request body:
{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 100,
"stop": ["\n"],
"stream": false
}Error responses:
| Status | Code | Cause |
|---|---|---|
| 401 | invalid_api_key |
Missing or invalid API key |
| 404 | model_not_found |
Model not available in Ollama |
| 422 | — | Invalid request body |
| 502 | backend_error |
Ollama connection failure |
| 503 | backend_overloaded |
Ollama overloaded |
| 503 | queue_full |
Request queue at capacity |
| 503 | shutting_down |
Server is draining for shutdown |
| 504 | request_timeout |
Backend response timeout |
Liveness probe. Always returns 200.
Readiness probe. Returns 200 if backend is healthy and queue has capacity, 503 otherwise.
Prometheus metrics endpoint. Custom inference metrics:
| Metric | Type | Description |
|---|---|---|
request_latency_seconds |
Histogram | Total request latency |
time_to_first_token_seconds |
Histogram | Time to first content token (streaming) |
tokens_per_second |
Histogram | Token generation speed |
tokens_generated_total |
Counter | Total tokens generated |
active_requests |
Gauge | Currently processing requests |
request_queue_depth |
Gauge | Current queue depth |
model_loaded |
Gauge | Whether a model is loaded (0/1) |
inference_errors_total |
Counter | Errors by type (backend_error, model_not_found, overload, queue_full, auth_failure) |
requests_total |
Counter | Requests by model and HTTP status |
queue_wait_seconds |
Histogram | Time from arrival to queue slot |
backend_latency_seconds |
Histogram | Backend-reported processing time |
The production stack runs on a Hetzner CAX21 (ARM64, 8GB RAM) with 4 containers:
| Container | Purpose |
|---|---|
model-serving-caddy |
Reverse proxy, auto-TLS via Let's Encrypt |
model-serving-api |
FastAPI application |
model-serving-ollama |
Ollama with llama3.2 model |
model-serving-alloy |
Grafana Alloy — ships metrics and logs to Grafana Cloud |
# Deploy (from local machine)
./deploy/deploy.sh
# Or manually
ssh user@server "cd /opt/model-serving-api && docker compose pull && docker compose up -d"The systemd service (model-serving.service) ensures the stack auto-starts on reboot.
# Install dependencies
uv sync --extra dev
# Run tests (87 tests)
uv run pytest tests/ -v
# Lint
uv run ruff check src/ tests/
# Run locally (requires Ollama running on localhost:11434)
OLLAMA_URL=http://localhost:11434 uv run uvicorn model_serving_api.main:app --reloadsrc/model_serving_api/
├── main.py # FastAPI app, lifespan, warm-up, graceful shutdown
├── config.py # Environment variable settings
├── api/
│ ├── completions.py # /v1/chat/completions endpoint
│ └── health.py # /health, /ready endpoints
├── backends/
│ ├── protocol.py # InferenceBackend Protocol + data types
│ └── ollama.py # OllamaBackend implementation
├── queue/
│ ├── manager.py # Bounded request queue + backpressure
│ └── batcher.py # Naive batch dispatcher
├── metrics/
│ └── prometheus.py # Custom inference metrics
├── streaming/
│ └── sse.py # SSE chunk formatting
└── middleware/
└── logging.py # Request ID + structured logging
deploy/
├── Caddyfile # Caddy reverse proxy config
├── alloy/
│ └── config.alloy # Grafana Alloy telemetry config
└── grafana/
└── dashboard.json # Pre-built Grafana dashboard (14 panels)
scripts/
├── load_test.py # Quick load test script
└── experiments.py # Comprehensive experiment suite (8 experiments)
| Decision | Choice | Why |
|---|---|---|
| Backend as Protocol | InferenceBackend with generate(), stream(), health() |
Enables swapping Ollama for llama-cpp-python or other backends without touching serving logic |
| Bounded queue | Counter-based with 503 + Retry-After | Backpressure prevents cascading failures; simpler than asyncio.Queue since asyncio is single-threaded |
| Naive batching | Time/size threshold dispatch via asyncio.gather | Demonstrates the pattern; continuous batching requires direct model access |
| Custom Prometheus metrics | TTFT, tokens/sec, queue depth, error counters | Inference-specific metrics are more valuable than generic HTTP metrics |
| SSE streaming | sse-starlette with OpenAI chunk format | Industry standard; every client library works with it |
| Graceful shutdown | Lifespan teardown with drain polling | Uvicorn handles SIGTERM; app drains in-flight requests before closing backend |
| Caddy for TLS | Auto-certificate via Let's Encrypt | Zero-config HTTPS with automatic renewal |
| Grafana Alloy | Single agent for metrics + logs | Replaces both Prometheus server and log shipper; Grafana Cloud free tier for storage |