Model Serving API

A production-grade model serving layer that wraps Ollama behind an OpenAI-compatible API with request batching, SSE streaming, backpressure, graceful degradation, and Prometheus metrics.

Architecture

Client (curl / OpenAI SDK)
    │
    ▼
┌─────────────────────────────────────┐
│  Caddy (reverse proxy, auto-TLS)   │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│  FastAPI Server                     │
│  ├─ /v1/chat/completions           │  ← OpenAI-compatible endpoint
│  ├─ /health, /ready                │  ← Health probes
│  ├─ /metrics                       │  ← Prometheus scrape
│  │                                  │
│  Bounded Request Queue              │
│  ┌────────────────────────┐        │
│  │ max_size=50, 503 when  │        │  ← Backpressure via Retry-After
│  │ full                    │        │
│  └─────────┬──────────────┘        │
│            │                        │
│  Batch Dispatcher                   │
│  ┌─────────▼──────────────┐        │
│  │ Collect for 100ms or   │        │  ← Naive batching (time/size)
│  │ 8 requests              │        │
│  └─────────┬──────────────┘        │
│            │                        │
│  InferenceBackend Protocol          │
│  ┌─────────▼──────────────┐        │
│  │ OllamaBackend          │        │  ← HTTP client to Ollama
│  └────────────────────────┘        │
└─────────────────────────────────────┘
    │
    ▼
┌──────────────┐     ┌───────────────┐
│  Ollama      │     │  Grafana Alloy│ → Grafana Cloud
│  (llama3.2)  │     │  (telemetry)  │   (metrics + logs)
└──────────────┘     └───────────────┘

Features

OpenAI-compatible API — /v1/chat/completions with streaming and non-streaming modes
SSE streaming — Token-by-token delivery in OpenAI chunk format
Request batching — Naive batching with configurable time/size thresholds, concurrent dispatch
Backpressure — Bounded request queue; returns 503 + Retry-After when full
Prometheus metrics — 11 custom inference metrics: TTFT, tokens/sec, queue depth, error rates, latency histograms
Observability — Grafana Alloy ships metrics and logs to Grafana Cloud
Graceful shutdown — SIGTERM → reject new requests → drain in-flight → close backend
API key auth — Optional Bearer token authentication
Model warm-up — Dummy request on startup to pre-load the model into memory
Structured logging — JSON logs via structlog with per-request request_id, model, and stream context
Health checks — /health (liveness) and /ready (readiness with backend + queue status)
Error handling — Proper HTTP status codes (401, 404, 502, 503, 504) with structured error responses
Backend abstraction — InferenceBackend Protocol enables swapping Ollama for other backends

Quickstart

Prerequisites

Docker and Docker Compose
~4 GB disk space for the llama3.2 model

Run

# Start the stack (API + Ollama + Caddy)
docker compose up -d

# Pull a model (first time only)
docker exec model-serving-ollama ollama pull llama3.2

# Wait for readiness
curl http://localhost:8000/ready

Try it

Non-streaming:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | python -m json.tool

Streaming:

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Benchmarks

Benchmarks run against the production server (Hetzner CAX21, ARM64, 8GB RAM) running llama3.2 (3B). Regenerate with scripts/experiments.py.

Throughput

Metric	Value
Token generation rate	7.2–7.5 tok/s (consistent across runs)
Sustained throughput (concurrency=5, 2 min)	56 requests, 990 tokens, 7.6 tok/s
Baseline latency (single request)	2.0–2.9s

Latency vs Concurrency

Concurrency	Avg Latency	Success Rate
1	7.4s	100%
5	10.0s	100%
10	18.7s	100%
20	42.9s	100%
30	20.9s	27% (22 timeouts)
50	43.7s	16% (42 timeouts)
60	39.2s	20% (10 rejected, 38 timeouts)

At concurrency 30+, requests start timing out (60s limit) because Ollama processes sequentially. At 60, the queue (max=50) rejects the overflow with 503.

Backpressure

100 simultaneous requests against a queue of 50:

50 rejected instantly with 503 (avg response time: 0.87s)
50 accepted into the queue — all timed out (504) because Ollama can't process 50 requests in 60s

The queue protects the system: rejected clients get an instant response and can retry, instead of waiting indefinitely.

Streaming vs Non-Streaming

Mode	Sequential Latency	Concurrent(5) Latency	TTFT
Non-streaming	2.85s	12.29s	—
Streaming	3.03s	8.35s	0.85s (seq) / 5.99s (conc)

Streaming is faster under concurrency because it bypasses the batch dispatcher and goes directly to the backend.

Time to First Token

Condition	Mean TTFT	Min	Max
No contention (sequential)	0.87s	0.53s	0.93s
Concurrency=5	9.02s	1.08s	10.83s

TTFT degrades ~10x under contention because requests queue behind each other at Ollama.

Prompt Length Impact

Prompt	Avg Latency
Short (5 tokens)	2.0s
Long (~50 tokens)	4.9s
5-turn conversation	5.9s
10-turn conversation	7.8s

Configuration

All settings via environment variables:

Variable	Default	Description
`MODEL_NAME`	`llama3.2`	Default model name
`OLLAMA_URL`	`http://ollama:11434`	Ollama backend URL
`MAX_QUEUE_SIZE`	`50`	Max concurrent requests before 503
`BATCH_WAIT_MS`	`100`	Max time to wait for batch to fill
`BATCH_MAX_SIZE`	`8`	Max requests per batch
`REQUEST_TIMEOUT_SECONDS`	`60`	Per-request timeout
`SHUTDOWN_GRACE_SECONDS`	`30`	Drain timeout on shutdown
`API_KEY`	(none)	Bearer token for auth (disabled if unset)
`HOST`	`0.0.0.0`	Server bind address
`PORT`	`8000`	Server port
`LOG_LEVEL`	`info`	Log level

API Reference

`POST /v1/chat/completions`

OpenAI-compatible chat completion. Supports stream: true for SSE.

Request body:

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello"}],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 100,
  "stop": ["\n"],
  "stream": false
}

Error responses:

Status	Code	Cause
401	`invalid_api_key`	Missing or invalid API key
404	`model_not_found`	Model not available in Ollama
422	—	Invalid request body
502	`backend_error`	Ollama connection failure
503	`backend_overloaded`	Ollama overloaded
503	`queue_full`	Request queue at capacity
503	`shutting_down`	Server is draining for shutdown
504	`request_timeout`	Backend response timeout

`GET /health`

Liveness probe. Always returns 200.

`GET /ready`

Readiness probe. Returns 200 if backend is healthy and queue has capacity, 503 otherwise.

`GET /metrics`

Prometheus metrics endpoint. Custom inference metrics:

Metric	Type	Description
`request_latency_seconds`	Histogram	Total request latency
`time_to_first_token_seconds`	Histogram	Time to first content token (streaming)
`tokens_per_second`	Histogram	Token generation speed
`tokens_generated_total`	Counter	Total tokens generated
`active_requests`	Gauge	Currently processing requests
`request_queue_depth`	Gauge	Current queue depth
`model_loaded`	Gauge	Whether a model is loaded (0/1)
`inference_errors_total`	Counter	Errors by type (backend_error, model_not_found, overload, queue_full, auth_failure)
`requests_total`	Counter	Requests by model and HTTP status
`queue_wait_seconds`	Histogram	Time from arrival to queue slot
`backend_latency_seconds`	Histogram	Backend-reported processing time

Deployment

The production stack runs on a Hetzner CAX21 (ARM64, 8GB RAM) with 4 containers:

Container	Purpose
`model-serving-caddy`	Reverse proxy, auto-TLS via Let's Encrypt
`model-serving-api`	FastAPI application
`model-serving-ollama`	Ollama with llama3.2 model
`model-serving-alloy`	Grafana Alloy — ships metrics and logs to Grafana Cloud

# Deploy (from local machine)
./deploy/deploy.sh

# Or manually
ssh user@server "cd /opt/model-serving-api && docker compose pull && docker compose up -d"

The systemd service (model-serving.service) ensures the stack auto-starts on reboot.

Development

# Install dependencies
uv sync --extra dev

# Run tests (87 tests)
uv run pytest tests/ -v

# Lint
uv run ruff check src/ tests/

# Run locally (requires Ollama running on localhost:11434)
OLLAMA_URL=http://localhost:11434 uv run uvicorn model_serving_api.main:app --reload

Project Structure

src/model_serving_api/
├── main.py              # FastAPI app, lifespan, warm-up, graceful shutdown
├── config.py            # Environment variable settings
├── api/
│   ├── completions.py   # /v1/chat/completions endpoint
│   └── health.py        # /health, /ready endpoints
├── backends/
│   ├── protocol.py      # InferenceBackend Protocol + data types
│   └── ollama.py        # OllamaBackend implementation
├── queue/
│   ├── manager.py       # Bounded request queue + backpressure
│   └── batcher.py       # Naive batch dispatcher
├── metrics/
│   └── prometheus.py    # Custom inference metrics
├── streaming/
│   └── sse.py           # SSE chunk formatting
└── middleware/
    └── logging.py       # Request ID + structured logging

deploy/
├── Caddyfile            # Caddy reverse proxy config
├── alloy/
│   └── config.alloy     # Grafana Alloy telemetry config
└── grafana/
    └── dashboard.json   # Pre-built Grafana dashboard (14 panels)

scripts/
├── load_test.py         # Quick load test script
└── experiments.py       # Comprehensive experiment suite (8 experiments)

Design Decisions

Decision	Choice	Why
Backend as Protocol	`InferenceBackend` with `generate()`, `stream()`, `health()`	Enables swapping Ollama for llama-cpp-python or other backends without touching serving logic
Bounded queue	Counter-based with 503 + Retry-After	Backpressure prevents cascading failures; simpler than asyncio.Queue since asyncio is single-threaded
Naive batching	Time/size threshold dispatch via asyncio.gather	Demonstrates the pattern; continuous batching requires direct model access
Custom Prometheus metrics	TTFT, tokens/sec, queue depth, error counters	Inference-specific metrics are more valuable than generic HTTP metrics
SSE streaming	sse-starlette with OpenAI chunk format	Industry standard; every client library works with it
Graceful shutdown	Lifespan teardown with drain polling	Uvicorn handles SIGTERM; app drains in-flight requests before closing backend
Caddy for TLS	Auto-certificate via Let's Encrypt	Zero-config HTTPS with automatic renewal
Grafana Alloy	Single agent for metrics + logs	Replaces both Prometheus server and log shipper; Grafana Cloud free tier for storage

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
deploy		deploy
scripts		scripts
src/model_serving_api		src/model_serving_api
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
justfile		justfile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Serving API

Architecture

Features

Quickstart

Prerequisites

Run

Try it

Benchmarks

Throughput

Latency vs Concurrency

Backpressure

Streaming vs Non-Streaming

Time to First Token

Prompt Length Impact

Configuration

API Reference

`POST /v1/chat/completions`

`GET /health`

`GET /ready`

`GET /metrics`

Deployment

Development

Project Structure

Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Model Serving API

Architecture

Features

Quickstart

Prerequisites

Run

Try it

Benchmarks

Throughput

Latency vs Concurrency

Backpressure

Streaming vs Non-Streaming

Time to First Token

Prompt Length Impact

Configuration

API Reference

POST /v1/chat/completions

GET /health

GET /ready

GET /metrics

Deployment

Development

Project Structure

Design Decisions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/chat/completions`

`GET /health`

`GET /ready`

`GET /metrics`

Packages