Skip to content

Latest commit

 

History

History
325 lines (246 loc) · 10.6 KB

File metadata and controls

325 lines (246 loc) · 10.6 KB

GPS Reliability API: Production Transformation Roadmap

Created: 2025-11-30 Status: In Progress Target: Production-ready SaaS (8.5+/10 readiness score)


Executive Summary

Current State: Production-ready MVP (Score: ~7.5/10) Session Progress: All P0, P1, and P2 (tier enforcement) issues RESOLVED Remaining: P2 (job persistence), P3 (observability)


Critical Issues Identified

1. Blocking I/O in API Endpoints (app/api/endpoints.py)

Problem: Lines 59, 193, 297, 429, 534 - All endpoint functions are def (synchronous), not async def.

# Current (BLOCKING):
@router.get("/gps-reliability")
def get_gps_reliability(...):  # Blocks event loop!
    weather = realtime_service.get_latest_weather()  # Sync I/O

Impact: Under load, one slow NOAA fetch blocks ALL concurrent requests.


2. Database Engine Creation Per Task (app/worker/tasks.py)

Problem: Lines 38-40 - Every Celery task creates a new database engine:

# Current (CONNECTION EXHAUSTION):
@celery_app.task
def log_api_usage(...):
    engine = create_engine(sync_url)  # New engine per task!

Impact: 100 concurrent tasks = immediate connection pool exhaustion.


3. Synchronous DB Write on Every Read (app/api/deps.py)

Problem: Lines 73-77 - Usage tracking writes on every API request:

api_key.calls_this_month += 1  # Write on every read
await session.commit()  # Blocks response

Impact: Every API request triggers a database write, limiting throughput.


4. Global Singletons (app/api/deps.py)

Problem: Lines 23-29 - Untestable global state:

_prediction_service = PredictionService()  # Global singleton

Impact: Cannot inject mocks for unit testing, shared state risks.


Priority Matrix

Priority Issue Severity Effort Status
P0 Blocking I/O in endpoints Critical Medium DONE
P0 DB engine per Celery task Critical Low DONE
P0 Sync usage tracking High Medium DONE
P1 Global singletons High Medium DONE
P1 No test suite High High DONE
P1 CORS ["*"] High Low DONE
P2 Tier enforcement Medium Medium DONE
P2 Job service persistence Medium Medium DONE
P3 Observability stack Medium High DONE

Phase 1: Critical Fixes (Week 1)

1.1 Convert Endpoints to Async

Convert all endpoint functions from def to async def and ensure all I/O-bound service methods are also async.

Files to modify:

  • app/api/endpoints.py - All 5 endpoint functions
  • app/services/realtime_service.py - HTTP calls to NOAA
  • app/services/ionosphere_service.py - HTTP calls to NOAA
  • app/services/radiation_service.py - HTTP calls to NOAA
  • app/services/prediction_service.py - Model loading (can stay sync, CPU-bound)

1.2 Fix Celery Database Connection Pooling

Create shared connection pool initialized at worker startup:

# app/worker/db.py (NEW FILE)
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from celery.signals import worker_process_init

_engine = None
_SessionLocal = None

@worker_process_init.connect
def init_worker_db(**kwargs):
    global _engine, _SessionLocal
    sync_url = settings.DATABASE_URL.replace("+asyncpg", "")
    _engine = create_engine(sync_url, pool_size=5, max_overflow=10, pool_pre_ping=True)
    _SessionLocal = sessionmaker(bind=_engine)

def get_session():
    if _SessionLocal is None:
        raise RuntimeError("Worker DB not initialized")
    return _SessionLocal()

1.3 Decouple Usage Tracking

Push usage events to Redis list, batch process every 10 seconds:

  1. Modify get_valid_api_key() to push to Redis instead of DB write
  2. Create new Celery task flush_usage_events() for batch processing
  3. Add to Celery Beat schedule (every 10 seconds)

1.4 Refactor Singletons

Convert global singletons to proper dependency injection with @lru_cache for stateless services.


Phase 2: Testing & Security (Week 2)

2.1 Test Infrastructure

tests/
├── conftest.py           # Shared fixtures
├── unit/
│   ├── test_prediction_service.py
│   ├── test_geomagnetic_service.py
│   └── test_risk_calculation.py
├── integration/
│   ├── test_auth_flow.py
│   ├── test_api_endpoints.py
│   └── test_rate_limiting.py
└── fixtures/
    ├── noaa_responses.json
    └── sample_api_keys.py

2.2 Security Hardening

  • Fix CORS to whitelist specific origins
  • Add security headers middleware
  • Implement input validation on all endpoints
  • Remove legacy dev-secret-key references

Phase 3: Scalability (Week 3-4)

3.1 Response Caching

Implement cache-aside pattern with Redis for GPS reliability responses.

3.2 Token Bucket Rate Limiting

Replace slowapi with Redis-backed token bucket for burst protection.

3.3 Database Optimization

  • Add indices on frequently queried columns
  • Configure connection pooling for production load
  • Implement query optimization

Phase 4: Infrastructure (Week 5-6)

4.1 PaaS Deployment

Recommended stack (avoid K8s initially):

Service Provider Purpose
API Render / Railway FastAPI (3 replicas)
Frontend Vercel Next.js edge
Database Supabase / Neon Managed Postgres
Cache Upstash Managed Redis
Worker Render Background Celery workers

4.2 Observability

  • Structured logging (JSON format)
  • Prometheus metrics
  • Sentry error tracking
  • Health check endpoints

Success Criteria

Metric Target
API uptime 99.9%
p50 latency < 100ms
p99 latency < 500ms
Test coverage >= 80%
Security vulnerabilities 0 critical/high

Progress Log

2025-11-30 - Phase 1 Critical Fixes Complete

  • Created production roadmap
  • Identified critical blocking I/O issues
  • COMPLETED: Convert all API endpoints to async def
    • app/api/endpoints.py - All 5 endpoints now async
    • Concurrent I/O with asyncio.gather() for weather + TEC fetches
  • COMPLETED: Make services async
    • app/services/realtime_service.py - Using httpx.AsyncClient
    • app/services/ionosphere_service.py - Using httpx.AsyncClient
    • app/services/radiation_service.py - Using httpx.AsyncClient
  • COMPLETED: Fix Celery DB connection pooling
    • Created app/worker/db.py with shared connection pool
    • Uses @worker_process_init signal for pool initialization
    • Updated all tasks to use get_session() context manager
  • COMPLETED: Fix CORS configuration
    • Environment-aware origin whitelist (dev vs prod)
    • Restricted methods and headers
  • COMPLETED: Fix deprecated datetime.utcnow() calls
    • Replaced all with datetime.now(timezone.utc)
    • Created helper utc_now() functions where needed
  • COMPLETED: Decouple usage tracking to Redis
    • Usage events pushed to Redis list (non-blocking)
    • New flush_usage_events Celery task processes every 10s
    • Graceful fallback to sync DB update if Redis unavailable
  • COMPLETED: Refactor singletons to proper DI
    • @lru_cache decorators for stateless services
    • get_realtime_service() accepts ionosphere dependency
    • Enables proper mocking via dependency_overrides
  • COMPLETED: Create test infrastructure
    • tests/conftest.py with shared fixtures
    • tests/unit/ with 37 passing tests
    • tests/integration/ marked for DB-dependent tests
    • pytest configuration in pyproject.toml
  • COMPLETED: Implement tier-based rate limiting
    • Created app/core/rate_limit.py with Redis-backed sliding window
    • Rate limits based on user tier (free: 60/min, business: 3000/min)
    • Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
    • Graceful fallback to local rate limiting if Redis unavailable
    • Replaced slowapi decorators with dependency injection

Session Summary (2025-11-30)

Commits Made

  1. f174f3f - Refactor: Convert to async I/O and fix critical production issues
  2. 74612c6 - Decouple usage tracking to Redis + refactor singletons to DI
  3. b1fd551 - Add pytest test infrastructure with 37 passing unit tests
  4. 5cfc7df - Implement tier-based rate limiting with Redis backend

Files Created

  • app/worker/db.py - Shared Celery worker DB connection pool
  • app/core/rate_limit.py - Tier-based rate limiting with Redis backend
  • tests/conftest.py - Shared pytest fixtures
  • tests/unit/test_geomagnetic_service.py - 23 geomagnetic calculation tests
  • tests/unit/test_risk_calculation.py - 14 risk scoring tests
  • tests/integration/test_api_endpoints.py - API endpoint tests
  • tests/integration/test_auth_flow.py - Authentication flow tests

Files Modified

  • app/api/endpoints.py - All endpoints now async, tier-based rate limiting
  • app/api/deps.py - DI with @lru_cache, Redis usage tracking
  • app/services/realtime_service.py - Async with httpx
  • app/services/ionosphere_service.py - Async with httpx
  • app/services/radiation_service.py - Async with httpx
  • app/worker/tasks.py - Uses shared connection pool
  • app/worker/celery_app.py - Added usage flush task schedule
  • app/core/redis.py - Added lpush/lrange_and_trim for queues
  • app/main.py - Removed slowapi, custom rate limit handler
  • pyproject.toml - Added pytest config and dev dependencies

Test Results

  • 54 unit tests passing (geomagnetic + risk calculation + job service)
  • Integration tests require database (marked for CI/CD)

Next Steps

  1. Job service persistence (use Redis or DB for job state) DONE
  2. Observability stack (structured logging, Prometheus, Sentry) DONE
  3. Deploy to production environment

2025-11-30 - Observability Stack Complete

  • Structured JSON Logging: Created app/core/logging.py with correlation IDs
    • JSON format in production, human-readable in development
    • Request logging middleware with timing
    • X-Request-ID header propagation
  • Prometheus Metrics: Created app/core/metrics.py
    • Request count, latency histograms
    • Business metrics (GPS reliability requests)
    • Service health gauges (Redis, models)
    • /metrics endpoint for Prometheus scraping
  • Sentry Integration: Configured in app/main.py
    • Error tracking with environment context
    • Traces and profiles sampling
    • Enabled via SENTRY_DSN environment variable
  • Enhanced Health Check: /health now checks dependencies
    • Redis connection status
    • ML models loaded status
    • Returns "healthy" or "degraded"
  • Dependencies Added: prometheus-client, sentry-sdk[fastapi]