Sentinel-AIOps

title	Sentinel
emoji	🔥
colorFrom	pink
colorTo	red
sdk	docker
pinned	false
license	mit
short_description	CI/CD error detection

Sentinel-AIOps

A production-grade MLOps system that classifies GitHub Actions CI/CD failures in real time using LightGBM, with PSI drift monitoring and a human-in-the-loop feedback loop. Live on Hugging Face Spaces.

Live Demo

Live deployment URL: https://huggingface.co/spaces/Anbu-00001/Sentinel

Visitors to the live Hugging Face Space see a real-time observability dashboard displaying the deployed model version, overall system health, current inference counts, a historical table of recent CI/CD failure classifications, and a live feature drift heatmap.

What It Does

The system provides end-to-end classification of CI/CD pipeline failures by ingesting workflow run events directly from GitHub Webhooks. Incoming payloads are verified via HMAC, transformed into feature vectors, and passed to an inference engine that persists predictions to a SQLite database. These results are immediately visible on a live observability dashboard for operators to review.

The machine learning layer utilizes a LightGBM multiclass classifier trained on an 884-dimension feature matrix to categorize errors into 10 distinct failure classes. An ablation study verified that the model learns its classifications from numerical operational telemetry (such as CPU usage, duration, and retries) rather than memorizing keywords in error message text.

The observability layer continuously monitors model degradation by computing the Population Stability Index (PSI) against a baseline to render a live feature drift heatmap. It provides operators with an inference history table, a dynamic health badge indicating whether retraining is required, and exposes Prometheus-compatible metrics for system health monitoring.

Architecture

flowchart TB
  subgraph GitHub
    GHA["GitHub Actions\nCI/CD Workflow"]
    WH["Webhook\nPOST /webhook/github"]
    GHA -->|workflow_run event| WH
  end
  subgraph HF_Spaces ["Hugging Face Spaces (Docker)"]
    NGINX["nginx:7860\nReverse Proxy"]
    DASH["FastAPI Dashboard\n:8200 (internal)"]
    DB[("SQLite\nsentinel.db")]
    LGBM["LightGBM\nlgbm_v2.11"]
    NGINX --> DASH
    DASH --> LGBM
    DASH --> DB
  end
  WH -->|HMAC-verified POST| NGINX
  BROWSER["Browser"] -->|HTTPS| NGINX

flowchart LR
  RAW["GitHub Webhook\nPayload"] --> MAP["Feature Mapping\n_map_github_to_features()"]
  MAP --> SCALE["StandardScaler\n6 numerical features"]
  MAP --> HASH["FeatureHasher\n256 dims\nrepo + author"]
  MAP --> TFIDF["TF-IDF\n600 dims\nerror_message"]
  MAP --> OHE["OneHotEncoder\nstage + severity"]
  SCALE & HASH & TFIDF & OHE --> CONCAT["Feature Matrix\n884 dimensions"]
  CONCAT --> LGBM["LightGBM\n120 estimators\nMulticlass (10)"]
  LGBM --> OUT["Prediction + Confidence\n+ Top Features"]
  OUT --> DB[("SQLite")]
  OUT --> DASH["Dashboard UI"]

ML Model

Property	Value
Algorithm	LightGBM (multiclass)
Classes	10 failure types
Feature dimensions	884
Training samples	10,000 (synthetic, per-class signal injection)
Macro F1	0.9007
Macro PR-AUC	0.9541
Ablation ΔF1 (no TF-IDF)	-0.004

The ablation result proves that the model classifies using operational telemetry (CPU, duration, retries) not by memorising error message keywords.

Classes: Build Failure, Configuration Error, Dependency Error, Deployment Failure, Network Error, Permission Error, Resource Exhaustion, Security Scan Failure, Test Failure, Timeout.

Confusion Matrix

Security

Security is enforced across the application lifecycle through HMAC-SHA256 webhook verification to guarantee the authenticity of incoming GitHub payloads. All .joblib model artifacts are cryptographically signed with HMAC to prevent arbitrary code execution vulnerabilities during joblib.load(). API key authentication protects all non-public administrative and data endpoints. Traffic is controlled using rate limiting via slowapi, while a strict 2MB payload size cap prevents resource exhaustion attacks. All cryptographic verifications use timing-safe comparisons via hmac.compare_digest.

Project Structure

Sentinel-AIOps/
├── dashboard/app.py          # FastAPI: webhook receiver, dashboard UI, drift API
├── mcp_server/
│   ├── server.py             # FastMCP inference server + Prometheus metrics
│   └── logic.py              # run_prediction(): feature transform + LightGBM
├── models/
│   ├── preprocess.py         # Feature engineering pipeline
│   ├── train_v2.py           # LightGBM training + F1 assertion gate
│   ├── drift_monitor.py      # PSI + Chi-Square drift computation
│   └── crypto_sig.py         # HMAC-SHA256 artifact signing
├── database/session.py       # SQLAlchemy + SQLite WAL setup
│                             # SQLAlchemy + Turso (libsql) in prod, plain SQLite in dev/CI
├── sentinel_logging.py       # Centralized JSON logging (all modules)
├── config.py                 # All thresholds and env var loading
├── Dockerfile.hf             # Single-container HF Spaces build
├── supervisord.conf          # Process manager: nginx + uvicorn
├── nginx.hf.conf             # Reverse proxy: 7860 → 8200
├── docker-compose.yml        # Local dev: 3-service compose
└── docker-compose.prod.yml   # Production overlay: named volumes, no ports

API Endpoints

Endpoint	Method	Auth	Description
`/`	GET	None	Dashboard UI
`/health`	GET	None	Health check
`/webhook/github`	POST	HMAC	Receive GitHub workflow_run events
`/api/dashboard`	GET	API Key	Full dashboard JSON payload
`/api/drift`	GET	API Key	Raw PSI drift report
`/api/registry`	GET	API Key	Model registry
`/api/history`	GET	API Key	Last 20 inferences

Local Development

git clone https://github.com/Anbu-00001/Sentinel-AIOps.git
cd Sentinel-AIOps
pip install -r requirements.txt
cp .env.example .env
# populate .env with generated secrets (see .env.example)
export $(cat .env | xargs)
make train
make resign
make test
python -m uvicorn dashboard.app:app --host 0.0.0.0 --port 8200

Deployment

Hugging Face Spaces (current production): Single Docker container running nginx + uvicorn via supervisord. See Dockerfile.hf and full steps in RUNBOOK.md → Hugging Face Spaces Deployment.

TURSO_DATABASE_URL and TURSO_AUTH_TOKEN enable persistent inference history via Turso (libsql). Without these, the Space falls back to ephemeral SQLite and history resets on restart.

Local Docker:

make docker-build && make docker-up

Test Suite

Suite	Tests	What it covers
`test_logic.py`	Unit	Inference pipeline, feature transforms
`test_api.py`	Integration	All 7 endpoints, auth, HMAC
`test_database.py`	Integration	SQLite WAL, concurrent writes
`test_drift.py`	Unit	PSI thresholds, retrain signal
`test_ablation.py`	ML	F1 with/without TF-IDF features
Total	91	All passing

Known Limitations

Synthetic training data — real-world accuracy on unfamiliar pipelines unknown
SQLite — not safe under >10 concurrent webhook POSTs
Single-tenant — no per-repo or per-user isolation
In-memory rate limiting — doesn't scale across multiple instances
Heuristic CPU/memory features — GitHub webhooks don't expose resource metrics
No automated retraining — PSI signal requires manual make retrain
Prometheus counters reset on restart
SQLite inference history on HF Spaces — resolved via Turso cloud database (libsql). Inference history now persists across Space restarts. Local dev and CI still use ephemeral SQLite.

Roadmap

GitHub App for automatic multi-repo webhook registration and per-org data isolation
GitHub Actions marketplace action: post failure classification as a PR comment
PostgreSQL migration for concurrent write safety and persistent HF deployment

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
assets		assets
dashboard		dashboard
data		data
database		database
mcp_server		mcp_server
models		models
scripts		scripts
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
0audit		0audit
AGENTS.md		AGENTS.md
CHANGES.md		CHANGES.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.dashboard		Dockerfile.dashboard
Dockerfile.hf		Dockerfile.hf
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RUNBOOK.md		RUNBOOK.md
SYSTEM_DESIGN.md		SYSTEM_DESIGN.md
audit1		audit1
config.py		config.py
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
nginx.conf		nginx.conf
nginx.hf.conf		nginx.hf.conf
pytest.ini		pytest.ini
requirements.txt		requirements.txt
sentinel_logging.py		sentinel_logging.py
setup.cfg		setup.cfg
supervisord.conf		supervisord.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentinel-AIOps

Live Demo

What It Does

Architecture

ML Model

Confusion Matrix

Security

Project Structure

API Endpoints

Local Development

Deployment

Test Suite

Known Limitations

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentinel-AIOps

Live Demo

What It Does

Architecture

ML Model

Confusion Matrix

Security

Project Structure

API Endpoints

Local Development

Deployment

Test Suite

Known Limitations

Roadmap

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages