GPS Reliability API: Production Transformation Roadmap

Created: 2025-11-30 Status: In Progress Target: Production-ready SaaS (8.5+/10 readiness score)

Executive Summary

Current State: Production-ready MVP (Score: ~7.5/10) Session Progress: All P0, P1, and P2 (tier enforcement) issues RESOLVED Remaining: P2 (job persistence), P3 (observability)

Critical Issues Identified

1. Blocking I/O in API Endpoints (`app/api/endpoints.py`)

Problem: Lines 59, 193, 297, 429, 534 - All endpoint functions are def (synchronous), not async def.

# Current (BLOCKING):
@router.get("/gps-reliability")
def get_gps_reliability(...):  # Blocks event loop!
    weather = realtime_service.get_latest_weather()  # Sync I/O

Impact: Under load, one slow NOAA fetch blocks ALL concurrent requests.

2. Database Engine Creation Per Task (`app/worker/tasks.py`)

Problem: Lines 38-40 - Every Celery task creates a new database engine:

# Current (CONNECTION EXHAUSTION):
@celery_app.task
def log_api_usage(...):
    engine = create_engine(sync_url)  # New engine per task!

Impact: 100 concurrent tasks = immediate connection pool exhaustion.

3. Synchronous DB Write on Every Read (`app/api/deps.py`)

Problem: Lines 73-77 - Usage tracking writes on every API request:

api_key.calls_this_month += 1  # Write on every read
await session.commit()  # Blocks response

Impact: Every API request triggers a database write, limiting throughput.

4. Global Singletons (`app/api/deps.py`)

Problem: Lines 23-29 - Untestable global state:

_prediction_service = PredictionService()  # Global singleton

Impact: Cannot inject mocks for unit testing, shared state risks.

Priority Matrix

Priority	Issue	Severity	Effort	Status
P0	Blocking I/O in endpoints	Critical	Medium	DONE
P0	DB engine per Celery task	Critical	Low	DONE
P0	Sync usage tracking	High	Medium	DONE
P1	Global singletons	High	Medium	DONE
P1	No test suite	High	High	DONE
P1	CORS `["*"]`	High	Low	DONE
P2	Tier enforcement	Medium	Medium	DONE
P2	Job service persistence	Medium	Medium	DONE
P3	Observability stack	Medium	High	DONE

Phase 1: Critical Fixes (Week 1)

1.1 Convert Endpoints to Async

Convert all endpoint functions from def to async def and ensure all I/O-bound service methods are also async.

Files to modify:

app/api/endpoints.py - All 5 endpoint functions
app/services/realtime_service.py - HTTP calls to NOAA
app/services/ionosphere_service.py - HTTP calls to NOAA
app/services/radiation_service.py - HTTP calls to NOAA
app/services/prediction_service.py - Model loading (can stay sync, CPU-bound)

1.2 Fix Celery Database Connection Pooling

Create shared connection pool initialized at worker startup:

# app/worker/db.py (NEW FILE)
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from celery.signals import worker_process_init

_engine = None
_SessionLocal = None

@worker_process_init.connect
def init_worker_db(**kwargs):
    global _engine, _SessionLocal
    sync_url = settings.DATABASE_URL.replace("+asyncpg", "")
    _engine = create_engine(sync_url, pool_size=5, max_overflow=10, pool_pre_ping=True)
    _SessionLocal = sessionmaker(bind=_engine)

def get_session():
    if _SessionLocal is None:
        raise RuntimeError("Worker DB not initialized")
    return _SessionLocal()

1.3 Decouple Usage Tracking

Push usage events to Redis list, batch process every 10 seconds:

Modify get_valid_api_key() to push to Redis instead of DB write
Create new Celery task flush_usage_events() for batch processing
Add to Celery Beat schedule (every 10 seconds)

1.4 Refactor Singletons

Convert global singletons to proper dependency injection with @lru_cache for stateless services.

Phase 2: Testing & Security (Week 2)

2.1 Test Infrastructure

tests/
├── conftest.py           # Shared fixtures
├── unit/
│   ├── test_prediction_service.py
│   ├── test_geomagnetic_service.py
│   └── test_risk_calculation.py
├── integration/
│   ├── test_auth_flow.py
│   ├── test_api_endpoints.py
│   └── test_rate_limiting.py
└── fixtures/
    ├── noaa_responses.json
    └── sample_api_keys.py

2.2 Security Hardening

Fix CORS to whitelist specific origins
Add security headers middleware
Implement input validation on all endpoints
Remove legacy dev-secret-key references

Phase 3: Scalability (Week 3-4)

3.1 Response Caching

Implement cache-aside pattern with Redis for GPS reliability responses.

3.2 Token Bucket Rate Limiting

Replace slowapi with Redis-backed token bucket for burst protection.

3.3 Database Optimization

Add indices on frequently queried columns
Configure connection pooling for production load
Implement query optimization

Phase 4: Infrastructure (Week 5-6)

4.1 PaaS Deployment

Recommended stack (avoid K8s initially):

Service	Provider	Purpose
API	Render / Railway	FastAPI (3 replicas)
Frontend	Vercel	Next.js edge
Database	Supabase / Neon	Managed Postgres
Cache	Upstash	Managed Redis
Worker	Render Background	Celery workers

4.2 Observability

Structured logging (JSON format)
Prometheus metrics
Sentry error tracking
Health check endpoints

Success Criteria

Metric	Target
API uptime	99.9%
p50 latency	< 100ms
p99 latency	< 500ms
Test coverage	>= 80%
Security vulnerabilities	0 critical/high

Progress Log

2025-11-30 - Phase 1 Critical Fixes Complete

Created production roadmap
Identified critical blocking I/O issues
COMPLETED: Convert all API endpoints to async def
- app/api/endpoints.py - All 5 endpoints now async
- Concurrent I/O with asyncio.gather() for weather + TEC fetches
COMPLETED: Make services async
- app/services/realtime_service.py - Using httpx.AsyncClient
- app/services/ionosphere_service.py - Using httpx.AsyncClient
- app/services/radiation_service.py - Using httpx.AsyncClient
COMPLETED: Fix Celery DB connection pooling
- Created app/worker/db.py with shared connection pool
- Uses @worker_process_init signal for pool initialization
- Updated all tasks to use get_session() context manager
COMPLETED: Fix CORS configuration
- Environment-aware origin whitelist (dev vs prod)
- Restricted methods and headers
COMPLETED: Fix deprecated datetime.utcnow() calls
- Replaced all with datetime.now(timezone.utc)
- Created helper utc_now() functions where needed
COMPLETED: Decouple usage tracking to Redis
- Usage events pushed to Redis list (non-blocking)
- New flush_usage_events Celery task processes every 10s
- Graceful fallback to sync DB update if Redis unavailable
COMPLETED: Refactor singletons to proper DI
- @lru_cache decorators for stateless services
- get_realtime_service() accepts ionosphere dependency
- Enables proper mocking via dependency_overrides
COMPLETED: Create test infrastructure
- tests/conftest.py with shared fixtures
- tests/unit/ with 37 passing tests
- tests/integration/ marked for DB-dependent tests
- pytest configuration in pyproject.toml
COMPLETED: Implement tier-based rate limiting
- Created app/core/rate_limit.py with Redis-backed sliding window
- Rate limits based on user tier (free: 60/min, business: 3000/min)
- Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
- Graceful fallback to local rate limiting if Redis unavailable
- Replaced slowapi decorators with dependency injection

Session Summary (2025-11-30)

Commits Made

f174f3f - Refactor: Convert to async I/O and fix critical production issues
74612c6 - Decouple usage tracking to Redis + refactor singletons to DI
b1fd551 - Add pytest test infrastructure with 37 passing unit tests
5cfc7df - Implement tier-based rate limiting with Redis backend

Files Created

app/worker/db.py - Shared Celery worker DB connection pool
app/core/rate_limit.py - Tier-based rate limiting with Redis backend
tests/conftest.py - Shared pytest fixtures
tests/unit/test_geomagnetic_service.py - 23 geomagnetic calculation tests
tests/unit/test_risk_calculation.py - 14 risk scoring tests
tests/integration/test_api_endpoints.py - API endpoint tests
tests/integration/test_auth_flow.py - Authentication flow tests

Files Modified

app/api/endpoints.py - All endpoints now async, tier-based rate limiting
app/api/deps.py - DI with @lru_cache, Redis usage tracking
app/services/realtime_service.py - Async with httpx
app/services/ionosphere_service.py - Async with httpx
app/services/radiation_service.py - Async with httpx
app/worker/tasks.py - Uses shared connection pool
app/worker/celery_app.py - Added usage flush task schedule
app/core/redis.py - Added lpush/lrange_and_trim for queues
app/main.py - Removed slowapi, custom rate limit handler
pyproject.toml - Added pytest config and dev dependencies

Test Results

54 unit tests passing (geomagnetic + risk calculation + job service)
Integration tests require database (marked for CI/CD)

Next Steps

~~Job service persistence (use Redis or DB for job state)~~ DONE
~~Observability stack (structured logging, Prometheus, Sentry)~~ DONE
Deploy to production environment

2025-11-30 - Observability Stack Complete

Structured JSON Logging: Created app/core/logging.py with correlation IDs
- JSON format in production, human-readable in development
- Request logging middleware with timing
- X-Request-ID header propagation
Prometheus Metrics: Created app/core/metrics.py
- Request count, latency histograms
- Business metrics (GPS reliability requests)
- Service health gauges (Redis, models)
- /metrics endpoint for Prometheus scraping
Sentry Integration: Configured in app/main.py
- Error tracking with environment context
- Traces and profiles sampling
- Enabled via SENTRY_DSN environment variable
Enhanced Health Check: /health now checks dependencies
- Redis connection status
- ML models loaded status
- Returns "healthy" or "degraded"
Dependencies Added: prometheus-client, sentry-sdk[fastapi]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPS Reliability API: Production Transformation Roadmap

Executive Summary

Critical Issues Identified

1. Blocking I/O in API Endpoints (`app/api/endpoints.py`)

2. Database Engine Creation Per Task (`app/worker/tasks.py`)

3. Synchronous DB Write on Every Read (`app/api/deps.py`)

4. Global Singletons (`app/api/deps.py`)

Priority Matrix

Phase 1: Critical Fixes (Week 1)

1.1 Convert Endpoints to Async

1.2 Fix Celery Database Connection Pooling

1.3 Decouple Usage Tracking

1.4 Refactor Singletons

Phase 2: Testing & Security (Week 2)

2.1 Test Infrastructure

2.2 Security Hardening

Phase 3: Scalability (Week 3-4)

3.1 Response Caching

3.2 Token Bucket Rate Limiting

3.3 Database Optimization

Phase 4: Infrastructure (Week 5-6)

4.1 PaaS Deployment

4.2 Observability

Success Criteria

Progress Log

2025-11-30 - Phase 1 Critical Fixes Complete

Session Summary (2025-11-30)

Commits Made

Files Created

Files Modified

Test Results

Next Steps

2025-11-30 - Observability Stack Complete

FilesExpand file tree

PRODUCTION_ROADMAP.md

Latest commit

History

PRODUCTION_ROADMAP.md

File metadata and controls

GPS Reliability API: Production Transformation Roadmap

Executive Summary

Critical Issues Identified

1. Blocking I/O in API Endpoints (app/api/endpoints.py)

2. Database Engine Creation Per Task (app/worker/tasks.py)

3. Synchronous DB Write on Every Read (app/api/deps.py)

4. Global Singletons (app/api/deps.py)

Priority Matrix

Phase 1: Critical Fixes (Week 1)

1.1 Convert Endpoints to Async

1.2 Fix Celery Database Connection Pooling

1.3 Decouple Usage Tracking

1.4 Refactor Singletons

Phase 2: Testing & Security (Week 2)

2.1 Test Infrastructure

2.2 Security Hardening

Phase 3: Scalability (Week 3-4)

3.1 Response Caching

3.2 Token Bucket Rate Limiting

3.3 Database Optimization

Phase 4: Infrastructure (Week 5-6)

4.1 PaaS Deployment

4.2 Observability

Success Criteria

Progress Log

2025-11-30 - Phase 1 Critical Fixes Complete

Session Summary (2025-11-30)

Commits Made

Files Created

Files Modified

Test Results

Next Steps

2025-11-30 - Observability Stack Complete

1. Blocking I/O in API Endpoints (`app/api/endpoints.py`)

2. Database Engine Creation Per Task (`app/worker/tasks.py`)

3. Synchronous DB Write on Every Read (`app/api/deps.py`)

4. Global Singletons (`app/api/deps.py`)