Production-oriented inference stack for 4-class medical image classification using ResNet50 + NVIDIA Triton + FastAPI, containerized with Docker Compose and validated with k6 load testing.
- Project Overview
- Why This Project Matters
- System Architecture
- Architecture Diagram (ASCII)
- Technology Stack
- Model Details
- Model Serving with Triton
- API Design
- Performance & Load Testing
- Worker Optimization Study
- Throughput Plateau Analysis
- Cold Start Behavior
- Scalability Strategy
- Deployment Instructions (Step-by-step)
- Hardware Specifications
- Key Engineering Decisions
- Future Improvements
- License
This repository demonstrates a practical, production-style ML inference system for medical image classification. The service accepts an image upload, runs preprocessing in FastAPI, sends tensors to NVIDIA Triton for GPU inference, then returns a predicted class with confidence and request latency.
Core objectives:
- Serve a deep learning model behind a stable HTTP API.
- Separate API responsibilities from inference execution.
- Use GPU acceleration and dynamic batching for throughput.
- Measure behavior under concurrent load and identify limits.
Many ML projects stop at model training and do not show how to deploy or operate inference systems under realistic traffic. This project focuses on the serving side:
- For ML engineers: demonstrates ONNX export and Triton deployment.
- For backend engineers: demonstrates API boundary design, validation flow, and latency tracking.
- For DevOps/platform engineers: demonstrates containerized GPU services and scaling boundaries.
- For hiring managers/recruiters: shows end-to-end execution from model packaging to load-tested API delivery.
The system intentionally splits responsibilities:
- FastAPI layer (
app/)- Handles multipart uploads (
/predict) and health checks (/health). - Runs image preprocessing and result postprocessing.
- Measures end-to-end per-request latency.
- Exposes Prometheus metrics endpoint via instrumentation.
- Handles multipart uploads (
- Triton Inference Server (
model_repository/)- Hosts ONNX model with GPU instance group.
- Handles inference requests over HTTP.
- Applies dynamic batching policy for better GPU efficiency.
- Docker Compose (
docker-compose.yml)- Boots API and Triton as isolated services.
- Mounts model repository into Triton.
- Exposes service ports for inference and observability.
Request lifecycle:
- Client sends an image to
POST /predict. - FastAPI preprocesses image to normalized NCHW FP32 tensor.
- API sends tensor to Triton model
resnet_medical. - Triton returns logits.
- API computes softmax-like confidence, maps class label, and returns JSON response.
+------------------------------+
| Client(s) |
| (web app, script, k6 load) |
+--------------+---------------+
|
| HTTP multipart/form-data
v
+-----------------------------------------------------------------------+
| FastAPI Service (API) |
|------------------------------------------------------------------------|
| /predict: parse upload -> preprocess -> Triton call -> postprocess |
| /health : liveness check |
| /metrics: Prometheus instrumentation |
+-------------------------------+----------------------------------------+
|
| HTTP infer request (FP32 tensor)
v
+-----------------------------------------------------------------------+
| NVIDIA Triton Inference Server |
|------------------------------------------------------------------------|
| Model: resnet_medical (ONNX Runtime backend) |
| Dynamic batching: preferred [4, 8, 16], queue delay 1000us |
| Instance group: GPU |
+-------------------------------+----------------------------------------+
|
v
+--------------+
| RTX 4060 GPU |
+--------------+
| Layer | Technology | Purpose |
|---|---|---|
| API | FastAPI + Uvicorn | Async HTTP serving and worker concurrency |
| Inference | NVIDIA Triton Inference Server | High-performance model serving |
| Model Runtime | ONNX Runtime (Triton backend) | Executes ONNX graph on GPU |
| Model | ResNet50 (4-class head) | Medical image classification |
| Containerization | Docker + Docker Compose | Reproducible multi-service deployment |
| Performance Testing | k6 | Load generation and latency/throughput measurement |
| Monitoring | Prometheus FastAPI Instrumentator | /metrics endpoint exposure |
- Base architecture: ResNet50
- Task type: Multi-class image classification
- Output classes (4):
- Glioma
- Meningioma
- Pituitary
- No Tumor
- Input tensor shape:
[batch, 3, 224, 224](FP32) - Preprocessing pipeline: RGB conversion, resize to
224x224, normalization to[0,1], channel-first transform - Export format: ONNX (
opset_version=17) with dynamic batch axis
The model export script (export_model.py) configures the final classification head for 4 classes and exports an ONNX artifact for Triton consumption.
Triton model configuration (model_repository/resnet_medical/config.pbtxt):
name: "resnet_medical"platform: "onnxruntime_onnx"max_batch_size: 16- Input:
TYPE_FP32, shape[3, 224, 224] - Output:
TYPE_FP32, shape[4] instance_group:KIND_GPU- Dynamic batching:
preferred_batch_size: [4, 8, 16]max_queue_delay_microseconds: 1000
- Dynamic batching combines incoming requests into larger GPU-friendly batches when timing allows.
- Queue delay cap (1000µs) avoids unbounded waiting and protects latency.
- GPU instance group ensures inference executes on CUDA hardware.
- Batch size cap (16) sets a practical upper limit aligned with memory and latency trade-offs.
| Method | Path | Description |
|---|---|---|
POST |
/predict |
Accepts an uploaded image and returns prediction/confidence/latency |
GET |
/health |
Health probe endpoint |
GET |
/metrics |
Prometheus metrics exposed by instrumentation |
curl -X POST "http://localhost:8002/predict" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "image=@tests/test.jpg"{
"prediction": "Glioma",
"confidence": 0.93,
"latency_ms": 11.7
}Notes:
- Exact class and confidence depend on input image and model state.
latency_msis computed at API level and includes preprocessing + Triton request + postprocessing.
Load testing was executed with k6 for 30 seconds across concurrency levels.
| Virtual Users | Avg Latency | P95 | Throughput (RPS) | Failures |
|---|---|---|---|---|
| 10 | 62ms | 80ms | 158 rps | 0% |
| 20 | 163ms | 134ms | 122 rps | 0% |
| 50 | 307ms | 334ms | 162 rps | 0% |
| 100 | 615ms | 641ms | 161 rps | 0% |
k6 run --vus 50 --duration 30s tests/load_test.jsObservations:
- Stable operation through 100 concurrent users.
- No failed requests in the provided test runs.
- Throughput plateaus around ~160 RPS.
- Latency grows as request pressure exceeds optimal processing capacity.
Uvicorn worker counts were compared for API throughput impact.
| Workers | Observed Outcome |
|---|---|
| 4 | Best result (~160 rps) |
| 6 | No improvement |
| 8 | Slight degradation |
Interpretation:
- Increasing workers beyond 4 added scheduling/context-switch overhead.
- Optimal worker count depended on CPU characteristics and API workload, not just raw core count.
Observed throughput ceiling: ~160 requests/sec.
At higher load:
- Throughput remains mostly flat.
- Latency continues to increase.
- Failures stay at 0% in tested scenarios.
Likely bottleneck in current setup:
- API layer and request handling overhead appear to saturate before GPU inference becomes the dominant limit.
This is consistent with a system where model execution is efficient, but ingress, serialization, and per-request CPU work cap end-to-end throughput.
First inference request latency was significantly higher (~2.6s), while warm requests stabilized around 7–14ms.
Expected causes:
- CUDA context initialization
- ONNX Runtime kernel loading
- Initial GPU memory allocation
Production mitigation options:
- Send warm-up requests after deployment.
- Keep service warm with periodic probes.
- Hide startup readiness behind orchestration health gates.
The architecture supports incremental scaling at multiple layers:
- API horizontal scaling
- Replicate FastAPI instances behind a load balancer.
- Keep stateless request handlers for easy scale-out.
- Inference scaling
- Deploy Triton per GPU and route traffic by model/service.
- Tune dynamic batching parameters for target latency SLOs.
- Orchestration migration path
- Move from Docker Compose to Kubernetes for autoscaling, rolling updates, and richer observability.
- Operational scaling
- Separate API and Triton resource limits to avoid noisy-neighbor contention.
- Docker Engine + Docker Compose
- NVIDIA GPU + compatible NVIDIA driver
- NVIDIA Container Toolkit (for GPU access inside containers)
git clone <your-repo-url>
cd Medical-AI-Inferencepython -m venv .venv
source .venv/bin/activate # Linux/macOS
pip install -r requirements.txtdocker compose up --buildThis launches:
tritonon8000(HTTP infer) and8001(gRPC)apion8002
curl http://localhost:8002/healthcurl -X POST "http://localhost:8002/predict" -F "image=@tests/test.jpg"k6 run --vus 50 --duration 30s tests/load_test.jstritonmounts./model_repository:/modelsand starts with--model-repository=/models.apidepends ontritonand forwards inference requests totriton:8000.- GPU device reservation is declared in compose for Triton service.
Benchmark environment:
- CPU: 16 Cores / 24 Logical Processors
- GPU: NVIDIA RTX 4060 Laptop GPU (8GB VRAM)
- RAM: 16GB DDR5
- OS: Windows 11 + Docker (WSL2 backend)
- Separated API and inference server
- FastAPI handles HTTP concerns; Triton specializes in model execution.
- ONNX + Triton runtime
- Standardized model artifact and production inference runtime decoupling.
- Dynamic batching enabled
- Better GPU utilization under concurrent traffic without rewriting API logic.
- Multi-worker API process model
- Tuned worker count empirically for best observed throughput.
- Container-first deployment
- Reproducible setup and clear path to orchestrated environments.
- Load-test-driven analysis
- Performance claims are backed by measured runs instead of assumptions.
- Add versioned model management and canary rollout strategy.
- Add request authentication and rate limiting.
- Add structured tracing (OpenTelemetry) across API ↔ Triton boundaries.
- Add automated CI benchmarks and regression thresholds.
- Expand monitoring dashboards (latency percentiles, GPU utilization, queue depth).
- Introduce Kubernetes deployment manifests/Helm charts for production rollout.
- This project focuses on scalable model serving and inference infrastructure.
- Model training is intentionally out of scope.
- Performance metrics were gathered on a single-node GPU-enabled environment.
- Throughput and latency characteristics are documented based on empirical load testing.



