Skip to content

brianhliou/model-serving-api

Repository files navigation

Model Serving API

A production-grade model serving layer that wraps Ollama behind an OpenAI-compatible API with request batching, SSE streaming, backpressure, graceful degradation, and Prometheus metrics.

Architecture

Client (curl / OpenAI SDK)
    │
    ▼
┌─────────────────────────────────────┐
│  Caddy (reverse proxy, auto-TLS)   │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│  FastAPI Server                     │
│  ├─ /v1/chat/completions           │  ← OpenAI-compatible endpoint
│  ├─ /health, /ready                │  ← Health probes
│  ├─ /metrics                       │  ← Prometheus scrape
│  │                                  │
│  Bounded Request Queue              │
│  ┌────────────────────────┐        │
│  │ max_size=50, 503 when  │        │  ← Backpressure via Retry-After
│  │ full                    │        │
│  └─────────┬──────────────┘        │
│            │                        │
│  Batch Dispatcher                   │
│  ┌─────────▼──────────────┐        │
│  │ Collect for 100ms or   │        │  ← Naive batching (time/size)
│  │ 8 requests              │        │
│  └─────────┬──────────────┘        │
│            │                        │
│  InferenceBackend Protocol          │
│  ┌─────────▼──────────────┐        │
│  │ OllamaBackend          │        │  ← HTTP client to Ollama
│  └────────────────────────┘        │
└─────────────────────────────────────┘
    │
    ▼
┌──────────────┐     ┌───────────────┐
│  Ollama      │     │  Grafana Alloy│ → Grafana Cloud
│  (llama3.2)  │     │  (telemetry)  │   (metrics + logs)
└──────────────┘     └───────────────┘

Features

  • OpenAI-compatible API/v1/chat/completions with streaming and non-streaming modes
  • SSE streaming — Token-by-token delivery in OpenAI chunk format
  • Request batching — Naive batching with configurable time/size thresholds, concurrent dispatch
  • Backpressure — Bounded request queue; returns 503 + Retry-After when full
  • Prometheus metrics — 11 custom inference metrics: TTFT, tokens/sec, queue depth, error rates, latency histograms
  • Observability — Grafana Alloy ships metrics and logs to Grafana Cloud
  • Graceful shutdown — SIGTERM → reject new requests → drain in-flight → close backend
  • API key auth — Optional Bearer token authentication
  • Model warm-up — Dummy request on startup to pre-load the model into memory
  • Structured logging — JSON logs via structlog with per-request request_id, model, and stream context
  • Health checks/health (liveness) and /ready (readiness with backend + queue status)
  • Error handling — Proper HTTP status codes (401, 404, 502, 503, 504) with structured error responses
  • Backend abstractionInferenceBackend Protocol enables swapping Ollama for other backends

Quickstart

Prerequisites

  • Docker and Docker Compose
  • ~4 GB disk space for the llama3.2 model

Run

# Start the stack (API + Ollama + Caddy)
docker compose up -d

# Pull a model (first time only)
docker exec model-serving-ollama ollama pull llama3.2

# Wait for readiness
curl http://localhost:8000/ready

Try it

Non-streaming:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | python -m json.tool

Streaming:

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Benchmarks

Benchmarks run against the production server (Hetzner CAX21, ARM64, 8GB RAM) running llama3.2 (3B). Regenerate with scripts/experiments.py.

Throughput

Metric Value
Token generation rate 7.2–7.5 tok/s (consistent across runs)
Sustained throughput (concurrency=5, 2 min) 56 requests, 990 tokens, 7.6 tok/s
Baseline latency (single request) 2.0–2.9s

Latency vs Concurrency

Concurrency Avg Latency Success Rate
1 7.4s 100%
5 10.0s 100%
10 18.7s 100%
20 42.9s 100%
30 20.9s 27% (22 timeouts)
50 43.7s 16% (42 timeouts)
60 39.2s 20% (10 rejected, 38 timeouts)

At concurrency 30+, requests start timing out (60s limit) because Ollama processes sequentially. At 60, the queue (max=50) rejects the overflow with 503.

Backpressure

100 simultaneous requests against a queue of 50:

  • 50 rejected instantly with 503 (avg response time: 0.87s)
  • 50 accepted into the queue — all timed out (504) because Ollama can't process 50 requests in 60s

The queue protects the system: rejected clients get an instant response and can retry, instead of waiting indefinitely.

Streaming vs Non-Streaming

Mode Sequential Latency Concurrent(5) Latency TTFT
Non-streaming 2.85s 12.29s
Streaming 3.03s 8.35s 0.85s (seq) / 5.99s (conc)

Streaming is faster under concurrency because it bypasses the batch dispatcher and goes directly to the backend.

Time to First Token

Condition Mean TTFT Min Max
No contention (sequential) 0.87s 0.53s 0.93s
Concurrency=5 9.02s 1.08s 10.83s

TTFT degrades ~10x under contention because requests queue behind each other at Ollama.

Prompt Length Impact

Prompt Avg Latency
Short (5 tokens) 2.0s
Long (~50 tokens) 4.9s
5-turn conversation 5.9s
10-turn conversation 7.8s

Configuration

All settings via environment variables:

Variable Default Description
MODEL_NAME llama3.2 Default model name
OLLAMA_URL http://ollama:11434 Ollama backend URL
MAX_QUEUE_SIZE 50 Max concurrent requests before 503
BATCH_WAIT_MS 100 Max time to wait for batch to fill
BATCH_MAX_SIZE 8 Max requests per batch
REQUEST_TIMEOUT_SECONDS 60 Per-request timeout
SHUTDOWN_GRACE_SECONDS 30 Drain timeout on shutdown
API_KEY (none) Bearer token for auth (disabled if unset)
HOST 0.0.0.0 Server bind address
PORT 8000 Server port
LOG_LEVEL info Log level

API Reference

POST /v1/chat/completions

OpenAI-compatible chat completion. Supports stream: true for SSE.

Request body:

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello"}],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 100,
  "stop": ["\n"],
  "stream": false
}

Error responses:

Status Code Cause
401 invalid_api_key Missing or invalid API key
404 model_not_found Model not available in Ollama
422 Invalid request body
502 backend_error Ollama connection failure
503 backend_overloaded Ollama overloaded
503 queue_full Request queue at capacity
503 shutting_down Server is draining for shutdown
504 request_timeout Backend response timeout

GET /health

Liveness probe. Always returns 200.

GET /ready

Readiness probe. Returns 200 if backend is healthy and queue has capacity, 503 otherwise.

GET /metrics

Prometheus metrics endpoint. Custom inference metrics:

Metric Type Description
request_latency_seconds Histogram Total request latency
time_to_first_token_seconds Histogram Time to first content token (streaming)
tokens_per_second Histogram Token generation speed
tokens_generated_total Counter Total tokens generated
active_requests Gauge Currently processing requests
request_queue_depth Gauge Current queue depth
model_loaded Gauge Whether a model is loaded (0/1)
inference_errors_total Counter Errors by type (backend_error, model_not_found, overload, queue_full, auth_failure)
requests_total Counter Requests by model and HTTP status
queue_wait_seconds Histogram Time from arrival to queue slot
backend_latency_seconds Histogram Backend-reported processing time

Deployment

The production stack runs on a Hetzner CAX21 (ARM64, 8GB RAM) with 4 containers:

Container Purpose
model-serving-caddy Reverse proxy, auto-TLS via Let's Encrypt
model-serving-api FastAPI application
model-serving-ollama Ollama with llama3.2 model
model-serving-alloy Grafana Alloy — ships metrics and logs to Grafana Cloud
# Deploy (from local machine)
./deploy/deploy.sh

# Or manually
ssh user@server "cd /opt/model-serving-api && docker compose pull && docker compose up -d"

The systemd service (model-serving.service) ensures the stack auto-starts on reboot.

Development

# Install dependencies
uv sync --extra dev

# Run tests (87 tests)
uv run pytest tests/ -v

# Lint
uv run ruff check src/ tests/

# Run locally (requires Ollama running on localhost:11434)
OLLAMA_URL=http://localhost:11434 uv run uvicorn model_serving_api.main:app --reload

Project Structure

src/model_serving_api/
├── main.py              # FastAPI app, lifespan, warm-up, graceful shutdown
├── config.py            # Environment variable settings
├── api/
│   ├── completions.py   # /v1/chat/completions endpoint
│   └── health.py        # /health, /ready endpoints
├── backends/
│   ├── protocol.py      # InferenceBackend Protocol + data types
│   └── ollama.py        # OllamaBackend implementation
├── queue/
│   ├── manager.py       # Bounded request queue + backpressure
│   └── batcher.py       # Naive batch dispatcher
├── metrics/
│   └── prometheus.py    # Custom inference metrics
├── streaming/
│   └── sse.py           # SSE chunk formatting
└── middleware/
    └── logging.py       # Request ID + structured logging

deploy/
├── Caddyfile            # Caddy reverse proxy config
├── alloy/
│   └── config.alloy     # Grafana Alloy telemetry config
└── grafana/
    └── dashboard.json   # Pre-built Grafana dashboard (14 panels)

scripts/
├── load_test.py         # Quick load test script
└── experiments.py       # Comprehensive experiment suite (8 experiments)

Design Decisions

Decision Choice Why
Backend as Protocol InferenceBackend with generate(), stream(), health() Enables swapping Ollama for llama-cpp-python or other backends without touching serving logic
Bounded queue Counter-based with 503 + Retry-After Backpressure prevents cascading failures; simpler than asyncio.Queue since asyncio is single-threaded
Naive batching Time/size threshold dispatch via asyncio.gather Demonstrates the pattern; continuous batching requires direct model access
Custom Prometheus metrics TTFT, tokens/sec, queue depth, error counters Inference-specific metrics are more valuable than generic HTTP metrics
SSE streaming sse-starlette with OpenAI chunk format Industry standard; every client library works with it
Graceful shutdown Lifespan teardown with drain polling Uvicorn handles SIGTERM; app drains in-flight requests before closing backend
Caddy for TLS Auto-certificate via Let's Encrypt Zero-config HTTPS with automatic renewal
Grafana Alloy Single agent for metrics + logs Replaces both Prometheus server and log shipper; Grafana Cloud free tier for storage

About

Production-grade model serving layer wrapping Ollama with request batching, SSE streaming, backpressure, and Prometheus metrics. OpenAI-compatible API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages