Skip to content

Recotem 2.0: clean rewrite as recipe-driven CLI/library#85

Merged
marevol merged 96 commits into
mainfrom
docs/recotem-2-design
May 14, 2026
Merged

Recotem 2.0: clean rewrite as recipe-driven CLI/library#85
marevol merged 96 commits into
mainfrom
docs/recotem-2-design

Conversation

@marevol
Copy link
Copy Markdown
Collaborator

@marevol marevol commented May 7, 2026

Summary

Recotem 2.0 — full clean rewrite from a 7-service Django/DRF/Channels/Vue stack to a single-package Python CLI/library distributed via PyPI + a single Docker image.

  • Simplification: drops Django, DRF, Channels, Daphne, Celery, PostgreSQL, Redis, Vue/Vite/PrimeVue/Tailwind, the nginx proxy, and the inference sub-service. The new package is pip install recotem (PEP 621 + uv-managed).
  • Recipe-driven: 1 YAML = 1 model = 1 /predict/{name} endpoint. recotem train recipe.yaml produces a signed artifact; recotem serve --recipes ./recipes/ watches the directory and hot-swaps on artifact change.
  • Data sources: pluggable via recotem.datasources entry points; built-in CSV / Parquet / BigQuery (GA4 BQ-export pattern). Discriminated union assembled dynamically from installed plugins. CSV/Parquet support local FS, object stores (s3://, gs://, az://, abfs://, abfss://), and http(s):// with mandatory sha256 integrity pin and RECOTEM_MAX_DOWNLOAD_BYTES raw-bytes cap.
  • Tuning: irspack + Optuna. Per-algorithm trial budgets, per-trial timeouts, in-memory or SQLite/Postgres-backed studies for resumable parallel runs. SQLite-backed studies auto-downgrade to parallelism=1 (autocommit + shared connection corrupts the DB).
  • Artifact format: HMAC-signed binary container with magic + version + reserved + kid + header_json + payload. Multi-kid KeyRing enables zero-downtime rotation. Hand-enumerated FQCN allow-list (defence in depth on top of HMAC verify) augmented by a narrow numpy.* / scipy.sparse.* module-prefix allow-list with high-risk submodule deny-list.
  • Serving: FastAPI with TrustedHostMiddleware, default-deny CORS, X-API-Key auth (scrypt digests, constant-time compare, no logging), ModelRegistry (RLock), ArtifactWatcher (read-once, ETag/sha256, bounded concurrent stats, recipe rescan, sentinel auto-recovery). /health is anonymous and exposes only {status, total, loaded}; per-recipe detail moved to /health/details (auth-gated) to stop kid / best_class / trained_at leakage.
  • Security posture: env expansion restricted to RECOTEM_RECIPE_*; never inside query / query_parameters; secrets blacklisted by substring regardless of prefix. Input vs output path-scheme split — inputs allow the explicit fsspec allow-list incl. http(s):// (mandatory sha256, byte cap, scheme-locked redirects, SSRF guard against RFC1918 / 127.0.0.0/8 / 169.254.0.0/16); outputs reject http(s)://, ftp(s)://, memory://. Embedded URI credentials rejected. --insecure-no-auth and --dev-allow-unsigned gated by RECOTEM_ENV. /docs + /openapi.json fail-secure (disabled unless RECOTEM_ENV ∈ {development, dev, test}). API-key digests use hashlib.scrypt with the recotem.api-key.v1 salt. Canonical security.posture log line emitted via try/finally so it survives partial-init failure. Log redaction processor first in chain.
  • Observability: structured JSON logging via structlog with first-in-chain redaction. train_done / train_error / tuning_aborted events carry canonical fields (name, run_id, exit_code, artifact, best_class, best_score, trials, trained_at, kid, error_class, error). /metrics is opt-in via RECOTEM_METRICS_ENABLED=true and exposes 12 Prometheus series (recotem_predict_total, recotem_predict_latency_seconds, recotem_model_loaded, recotem_artifact_load_failures_total, recotem_active_recipes, recotem_swap_total, recotem_artifact_stat_failures_total, recotem_watcher_unhandled_errors_total, recotem_metadata_lookup_errors_total, recotem_recipe_rescan_errors_total, recotem_bigquery_storage_fallback_total, recotem_recipes_dir_scan_failures_total).
  • Tests: 1,420 cases across unit / integration / fuzz tiers — every spec test case from §11 plus hypothesis byte-mutation fuzz on the artifact loader, YAML mutation fuzz on the recipe loader, HTTP CSV source integration via pytest-httpserver, real-watcher hot-swap concurrency tests, and broken-YAML startup resilience.

Spec lives at docs/superpowers/specs/2026-05-07-recotem-2-design.md plus a follow-on 2026-05-07-tutorial-and-https-source.md (both produced via collaborative brainstorm + parallel architect / security / test gap reviews).

Layout

This PR replaces the 1.x Django/Vue tree with the canonical 2.0 layout.

  • src/recotem/ — production code
  • tests/{unit,integration,fuzz,e2e}/ — full test suite
  • pyproject.toml (root) — PEP 621, hatchling, uv lock
  • Dockerfile — multi-stage python:3.12-slim, runs as appuser:1000, RECOTEM_PORT-aware healthcheck
  • compose.yaml — train one-shot + serve long-running (tutorial-runnable)
  • helm/recotem/ — serve-only chart with optional CronJob train, ServiceAccount, NetworkPolicy, PDB, HPA, preStop drain
  • .github/workflows/{test,docker,codeql}.yml — pytest + ruff + secrets-in-logs grep, multi-arch image build (PR build-only / main+tag push), CodeQL Python on schedule
  • examples/quickstart/, tutorial-purchase-log/ (HTTPS + sha256 pin, no manual data prep), csv-local/, ga4-bigquery/, k8s/, plugins/echo-source/ (each with its own README)
  • README.md + docs/ — getting-started, recipe-reference (incl. sha256 field, path-scheme allow-lists, KEY blacklist semantics), data-sources/{bigquery,csv} (incl. network-scheme support, BQ Storage API fallback policy), deployment/{docker,k8s,cron}, operations (signing-key + API-key rotation incl. step-4 verification, recovery, sizing, --lock-timeout), security (FQCN allow-list, trust boundaries, network-fetch threat model, scrypt API-key digest, signing_key_status field), plugin-authoring

Key features added during review hardening

  • HTTP/HTTPS CSV/Parquet sources via stdlib urllib with mandatory sha256 pin, configurable byte cap (RECOTEM_MAX_DOWNLOAD_BYTES, default 256 MiB), connect/read timeout (RECOTEM_HTTP_TIMEOUT_SECONDS, default 30s, clamped [1, 600]), and rejection of scheme-changing redirects. Private-IP refusal can be opted-out with RECOTEM_HTTP_ALLOW_PRIVATE.
  • tutorial-purchase-log example: zero-setup quickstart fetching a small public CSV with sha256 pin, mounted into the compose tutorial flow and exercised by --tutorial mode in the e2e script (gated on RECOTEM_E2E_NETWORK).
  • recotem validate wires DataSource.probe() for real connectivity checks.
  • /health vs /health/details split — anonymous aggregate vs auth-gated per-recipe detail.
  • OpenAPI fail-secure/docs and /openapi.json only enabled when RECOTEM_ENV ∈ {development, dev, test}.
  • Errno-discriminated serve exit codesEADDRINUSE / EACCES / EADDRNOTAVAIL → exit 8 (CONFIG); ENOMEM / EMFILE / etc. → exit 1 (UNKNOWN). Discriminator-missing in recipe fetch → exit 2 (RECIPE) instead of 4 (TRAINING).
  • RECOTEM_BQ_REQUIRE_STORAGE_API — when truthy, the BigQuery source raises DataSourceError instead of falling back to the REST path; ImportError honored too (avoids surprise billing).
  • RECOTEM_STARTUP_PARALLELISM — bounded parallel artifact load at serve startup (auto = min(len(recipes), 8), clamped [1, 32]).
  • OOM let-propagate in watcher _read_artifact_bytes, training/split.py, and training/pipeline.py time_column parse (was previously wrapped as ArtifactError / TrainingError).
  • DoS bypass fix — non-sha256 paths in CSV / Parquet / metadata loaders switched to uniform fh.read(cap+1) + BytesIO pattern; _size_cap credential-skip tightened to exact Authorization match.
  • Lenient startup — broken YAMLs no longer abort serve startup; they insert ModelEntry stubs (loaded=False, last_load_error=…) surfaced via /health/details.
  • Stale-on-swap-fail — keeps serving the previous model when hot-swap deserialization fails (instead of dropping to 503).
  • X-Recotem-Metadata-Degraded: 1 header on per-request metadata-lookup failures (was silent empty-dict).
  • Path validation split — input keeps fsspec schemes; output enforces local / object-store-only writes.
  • compose.yaml (renamed from docker-compose.example.yaml) with bind-mounted tutorial recipe + aligned artifact volume.

Test plan

  • uv sync --all-extras — all deps install cleanly
  • uv run pytest tests1,420 / 1,420 pass (2 deselected -m slow)
  • uv run pytest tests -m slow — MovieLens100K end-to-end training passes
  • uv run ruff check src tests — clean
  • uv run ruff format --check src tests — clean
  • uv run python -c "from recotem.cli import app" — entry point imports
  • Hypothesis fuzz (artifact + recipe loaders) — no unhandled exceptions across 200 examples
  • HTTP CSV source integration via pytest-httpserver (sha256 pass/fail, byte cap, timeout, redirect cap, scheme-change reject, SSRF private-IP reject)
  • e2e default mode (train → serve → predict via local CSV) and --tutorial mode (gated on RECOTEM_E2E_NETWORK) both pass
  • CI green: pytest, ruff lint, ruff format, secrets-in-logs grep, docker build, e2e (train → serve → predict), analyze (python), trivy vulnerability scan (CodeQL configured separately on schedule)

Spec gap analysis & review hardening (resolved)

Three parallel reviews — solution-architect / security-engineer / test-engineer — surfaced 7 CRITICAL + 30+ MAJOR gaps in the initial design, all addressed before implementation began. Following that, 17 rounds of multi-agent PR review (code-reviewer / silent-failure-hunter / pr-test-analyzer / security-engineer / performance-engineer / doc-parity) closed an additional 200+ findings spanning DoS bypass, SQLite race, OOM propagation, /health info leak, OpenAPI fail-secure, errno-discriminated exit codes, embedded-cred rejection across object-store schemes, signing-key rotation step-4 verification, Windows lock errno discrimination, BigQuery Storage API fallback gating, and exhaustive doc/code parity. Each fix carries corresponding tests.

Notable design decisions retained from spec review:

  • Multi-kid signing for zero-downtime rotation
  • Read-once protocol replaces stat-then-read (closes TOCTOU)
  • Per-recipe file lock prevents racing trains (host-local; cross-host needs scheduler mutex)
  • Atomic write differs on local FS vs object store; pointer-file pattern documented
  • Header / payload size caps; reserved-byte forward-compat guard
  • file:// / http:// / embedded-cred / userinfo path-scheme guards
  • Restricted env expansion + secret-substring blacklist
  • Hand-enumerated FQCN allow-list (not module prefix)
  • Min-data preconditions (min_rows / min_users / min_items)
  • TrustedHost / CORS default-deny
  • Log redaction processor first in chain
  • security.posture canonical log line at startup (try/finally guarded)
  • --insecure-no-auth / --dev-allow-unsigned env-gated guardrails
  • API-key hashing via scrypt with fixed recotem.api-key.v1 domain-separation salt (CodeQL py/weak-sensitive-data-hashing-clean)
  • Mandatory sha256 + byte cap + scheme-locked redirects on HTTP/HTTPS source fetch

marevol added 6 commits May 7, 2026 08:06
Captures the brainstorming outcome for a clean rewrite:
- Library + CLI distribution; drop Django/DRF/Channels/Vue/Postgres/Redis/Celery
- Recipe-driven (1 YAML = 1 model = 1 /predict endpoint)
- DataSource plugins (BigQuery + CSV/Parquet builtin)
- irspack + Optuna for tuning, FastAPI for serving with file-based hot-swap
- HMAC-signed artifact format with class-allowlisted deserialization
Incorporates findings from parallel architect / security / test reviews:

CRITICAL:
- Hand-enumerated FQCN allow-list (not module prefix) for unpickle defence
- fsspec scheme allow-list (no file://, http(s)://, ftp://, memory://)
- Recipe name re-validation before any FS/URL use; recipe path containment
- Restricted env var expansion: only RECOTEM_RECIPE_* prefix, never inside
  source.query; secret prefixes blacklisted
- Empty / near-empty data preconditions (min_rows / min_users / min_items)
- Schema drift handling between train and serve (fields allow-list contract)
- Memory caps: header ≤ 64 KiB, payload ≤ RECOTEM_MAX_ARTIFACT_BYTES (2 GiB)

MAJOR:
- Multi-key signing with kid-based zero-downtime rotation
- Read-once protocol replaces stat-then-read (closes TOCTOU)
- File lock per recipe to prevent racing trains
- Atomic write semantics differ on local FS vs object store; pointer file
  pattern documented
- Per-algorithm trial budgets and per-trial timeouts
- Parallel Optuna workers; explicit storage_path semantics
- Graceful drain on SIGTERM (RECOTEM_DRAIN_SECONDS)
- TrustedHostMiddleware + CORS default-deny
- Log redaction processor strips API/signing keys / cloud creds
- security.posture canonical log line at startup
- --insecure-no-auth and --dev-allow-unsigned guarded by RECOTEM_ENV
- Plugin contract: entry_points discovery + dynamic discriminated union
- 1.x→2.0 migration explicitly out of scope

Section 11 expanded to ~95 specific test cases covering every failure mode.
Section 13 added: DataSource plugin contract.
Adds src/recotem/ alongside the existing 1.x tree so the new code can be
implemented and tested without disturbing 1.x. Removal of 1.x will be a
separate authorized step once 2.0 is verified.

Skeleton:
- src/recotem/{recipe,datasource,training,artifact,metadata,serving}/__init__.py
- pyproject.toml (PEP 621, hatchling, optional extras [bigquery|s3|gcs|az|metrics])
- entry-points group recotem.datasources with builtin csv / parquet / bigquery
- tests/ scaffolding
…g, artifact, metadata, serving, cli

Implements 7,358 LOC across the modules described in spec sections 4-13:

- recipe/: pydantic v2 models with restricted env expansion, path scheme allow-list,
  recipe name re-validation; load_recipe / load_recipes_directory.
- datasource/: DataSource Protocol with plugin contract; entry_points discovery
  with dynamic discriminated-union; builtin csv/parquet (fsspec) and bigquery (ADC).
- training/: run_training pipeline (fetch -> cleanse -> split -> Optuna search ->
  final train -> artifact write), per-algorithm budget partitioning, per-trial
  timeout, file-lock per recipe, TTY-auto progress reporter.
- artifact/: HMAC-signed binary container (magic + version + kid + header + payload)
  with hand-enumerated FQCN allow-list for safe deserialization; KeyRing supports
  multi-kid rotation.
- metadata/: item metadata loader with fields allow-list / on-missing policy,
  string-coerced item_id index.
- serving/: FastAPI app factory with TrustedHost / CORS / lifespan; ModelRegistry
  (RLock); ArtifactWatcher thread (read-once protocol, ETag/sha256-aware,
  bounded concurrent stats); X-API-Key auth (constant-time, kid in request state);
  /predict/{name}, /health, /models, /metrics; log redaction processor first in chain.
- cli.py: Typer commands train / serve / inspect / validate / schema / keygen
  with spec-mandated exit-code mapping.
- config.py: ServeConfig + TrainConfig from env vars with posture validation.
- logging.py: structlog setup with redaction processor first.
Tests (tests/, 233 cases across 22 files):
- Unit: artifact format/signing/io, recipe loader/models, datasource csv/bq/registry,
  training pipeline/lock, metadata loader, serving registry/auth/app/routes/watcher,
  log redaction, cli (every spec exit code).
- Integration: in-process train+serve+predict; concurrent /predict during swap.
- Fuzz: hypothesis byte mutations on artifact loader; YAML mutations on recipe loader.
- E2E: bash script wiring train -> serve -> curl /predict on a generated CSV.

Docs (README2.md + docs2/, 12 files): quickstart, recipe-reference (every field),
data sources (bigquery / csv), deployment (docker / k8s / cron), operations
(signing-key rotation via multi-kid, API key rotation, recovery, sizing),
security (trust boundaries, FQCN allow-list), plugin authoring.

Infra:
- Dockerfile.recotem2: multi-stage python:3.12-slim, uv install, appuser:1000,
  bundles [bigquery,s3,gcs,metrics] extras.
- docker-compose.example.yaml: train one-shot + serve long-running, shared volumes.
- helm/recotem2/: serve-only chart with optional CronJob, NetworkPolicy, PDB, HPA.
- .github/workflows/recotem2-{test,docker,codeql}.yml: pytest+ruff+secrets-in-logs,
  multi-arch image build (push on main+tags), CodeQL Python.

Examples: csv-local + ga4-bigquery recipes, K8s manifests, plugin template
(echo-source) showing the entry-points contract.

All additions are namespaced to avoid clobbering 1.x (README2.md, docs2/,
Dockerfile.recotem2, helm/recotem2/, recotem2-*.yml). 1.x removal is a separate
authorized step.
Concrete fixes after running pytest + ruff:

- routes.py: use `kid: str = Depends(_require_auth)` instead of
  `Annotated[str, Depends(...)]` because `from __future__ import annotations`
  string-ifies the annotation and FastAPI cannot resolve closure-scoped
  `_require_auth` from module globals — without this fix every authenticated
  route saw `kid` as a query parameter (HTTP 422).
- recipe/loader.py: after Recipe.model_validate, promote raw `source` dict
  to the typed Config by looking up the source class via the dynamic
  discriminated union (`get_source_class(type)`). Pipeline depends on
  `recipe.source` being typed.
- training/pipeline.py: rewrite `_fetch_data` to follow the spec section 13
  contract (`source_cls(config).fetch(ctx)`); pass run_id through.
- training/pipeline.py: coerce IDs via `astype(str).astype(object)` so numpy
  shuffle in irspack does not encounter pandas ArrowStringArray.
- training/__init__.py: import `_compat` first so irspack's transitive
  `fastprogress -> IPython.display` import succeeds in stub-less envs.
- pyproject.toml: register `slow` pytest marker; relax ruff per-file ignores
  for test idioms (F841/B007/B011/SIM117) and minor src-side rules.
- tests/conftest.py: import `recotem.training._compat` before MovieLens
  fixture; rename map covers irspack's `userId`/`movieId` defaults.
- tests/fuzz: hypothesis kwarg is `suppress_health_check` (singular).
- tests/unit/test_cli.py: drop `mix_stderr=False` (removed in Click 8.2).
- tests/unit/test_datasource_bigquery.py: bind GoogleAPICallError on the
  `exceptions` MagicMock that the import statement actually resolves.
- tests/unit/test_datasource_csv.py + test_training_pipeline.py: accept
  pandas `StringDtype` as well as object dtype (pandas 2.x default).
- ruff format pass over src/ + tests/.

Result: 237 tests pass, ruff check + format both green.
Comment thread src/recotem/serving/auth.py Fixed
Comment thread src/recotem/training/lock.py Fixed
Comment thread src/recotem/training/lock.py Fixed
Comment thread tests/unit/test_training_lock.py Fixed
Comment thread tests/unit/test_training_lock.py Fixed
marevol added 2 commits May 7, 2026 09:32
BREAKING CHANGE: complete removal of the 1.x stack (Django backend, Vue
frontend, FastAPI inference sub-service, Postgres/Redis/Celery infrastructure).
The repository now contains only the Recotem 2.0 single-package CLI/library.

Removed (1.x, ~60k LOC of generated/legacy code):
- backend/             Django + DRF + Channels + Celery
- frontend/            Vue 3 SPA + PrimeVue + Tailwind + Playwright
- inference/           Standalone FastAPI inference sub-service
- helm/recotem/ (1.x)  Multi-service Helm chart
- envs/                Django environment files
- proxy.dockerfile     Nginx + SPA build
- nginx*.conf          Reverse-proxy configs (3 variants)
- compose*.yaml        Multi-service docker-compose definitions (4 variants)
- Makefile             1.x dev shortcuts
- release.py           1.x compose bundle script
- docs/{guide,specification,deployment}/   1.x user/dev/ops docs
- .github/workflows/{pre-commit,run-test,release,codeql}.yml   1.x CI

Renamed (2.0 → canonical):
- Dockerfile.recotem2                   → Dockerfile
- README2.md                            → README.md
- docs2/*                               → docs/*  (kept docs/superpowers/)
- helm/recotem2/                        → helm/recotem/
- .github/workflows/recotem2-test.yml   → test.yml
- .github/workflows/recotem2-docker.yml → release.yml
- .github/workflows/recotem2-codeql.yml → codeql.yml

Updated:
- CLAUDE.md           rewritten for 2.0 architecture (env vars, conventions)
- CONTRIBUTING.md     rewritten for 2.0 dev workflow
- .gitignore          dropped 1.x-specific paths; added Recotem runtime
                      artifact patterns (*.recotem, *.recotem.lock)
- helm/recotem/, examples/k8s/, .github/workflows/ have all internal
  references updated from "recotem2" to "recotem"

Verification:
- uv run pytest tests   →  236 / 236 pass (1 slow test deselected)
- uv run ruff check     →  clean
- uv run ruff format    →  clean
…ddress CodeQL findings

CI fixes:
- actions/upload-artifact pinned to v7 (v8 does not exist; latest is v7.0.1).
  download-artifact stays at v8 (latest is v8.0.1).
- Dockerfile copies LICENSE + README.md alongside pyproject.toml + uv.lock
  before `uv sync` runs, because hatchling reads
  `license = { file = "LICENSE" }` during the build step.

CodeQL findings (PR #85):
- src/recotem/serving/auth.py: replace `_sha256_hex` with `_hash_api_key`
  using HMAC-SHA256 over the label b"recotem.api-key.v1". Provides
  domain separation; defeats the "weak hash on password" CodeQL pattern;
  matches the GitHub / Stripe API-key model. Wire format unchanged
  (`<kid>:sha256:<hex64>`); the `sha256:` prefix now identifies the digest
  family rather than a keyless construction.
- src/recotem/cli.py: keygen --type api computes the same HMAC-SHA256
  so server and client agree on the stored hash.
- src/recotem/training/lock.py: lock file mode 0o644 -> 0o600 (POSIX flock
  and Windows fallback paths). The lock file is an empty marker; tighter
  permissions are pure defense-in-depth.
- tests/unit/test_training_lock.py: same 0o600 in helper code.
- tests/{unit/test_serving_auth, unit/test_serving_routes,
  integration/test_serve_predict_e2e}.py: helpers mirror the new HMAC
  formula so tests still produce matching hashes.

Spec update (Section 9): document the HMAC-with-label construction and
clarify that the wire prefix `sha256:` denotes the digest family.

Verification:
- uv run pytest tests -m "not slow"  -> 236 / 236 pass
- uv run ruff check src tests          -> clean
Comment thread src/recotem/serving/auth.py Fixed
marevol added 3 commits May 7, 2026 09:49
- test.yml: pytest invocation gains `-m "not slow"` so the MovieLens100K
  download (which prompts for stdin in CI) is excluded. The slow E2E test
  is intentionally a developer-local validation only.
- Dockerfile: drop the now-invalid `uv pip install --no-deps --extra ... .`
  syntax. Dependencies (including extras) are already resolved by the
  preceding `uv sync` step, so a plain `uv pip install --no-deps .`
  installs the recotem package itself without re-resolving extras.
…ormat

Three runtime bugs in the integration path between cli, pipeline, write_artifact,
and the SafeUnpickler allow-list — none caught by unit tests because they
mocked the artifact path:

1. CLI passed `no_lock`, `fail_on_busy`, `dev_allow_unsigned` to run_training
   but the function did not declare them and required `key_ring`/`signing_key`
   the CLI never built.

   Fix: run_training now accepts the three CLI flags, builds a KeyRing from
   RECOTEM_SIGNING_KEYS itself when the caller does not pass one, falls back
   to a deterministic in-memory key when `dev_allow_unsigned=True`, and wraps
   the body in `recipe_lock` (skipping when contended unless fail_on_busy).
   Body extracted to `_run_training_locked` for clarity.

2. Pipeline called `write_artifact(payload=..., output_path=..., signing_key=...)`
   with kwarg names that do not exist; the real signature is positional
   `(payload_obj, header_dict, key_ring, fs_path, *, versioning)`. Pipeline
   was also double-pickling.

   Fix: positional call without redundant pickle.

3. SafeUnpickler allow-list listed `recotem.serving._compat.IDMappedRecommender`
   and `irspack.recommenders.<Class>` but pickle records the *original*
   defining module path, which is `recotem.training._compat` and
   `irspack.recommenders.{ials,knn,toppop,rp3,dense_slim,truncsvd}` (per
   submodule). Numpy 2.x also reorganised `numpy.core.*` to `numpy._core.*`
   and added serialisation helpers like `numpy._core.numeric._frombuffer`
   that change between releases.

   Fix: keep the hand-enumerated FQCN tier for recotem / irspack / builtins,
   add a module-prefix tier scoped to `numpy.*` and `scipy.sparse.*`
   (HMAC remains the primary defence), and add an explicit deny list
   over the prefix for `numpy.testing`, `numpy.distutils`, `numpy.f2py`,
   `numpy.ctypeslib` so test-runner / build-tool gadgets stay rejected.

Spec section 8 updated to document the two-tier allow-list with the deny
override.

Verification:
- bash tests/e2e/run.sh -> /predict returns 200 with item recommendations
- uv run pytest tests -m "not slow" -> 236 / 236 pass
- uv run ruff check src tests -> clean
…or secrets grep

- src/recotem/serving/auth.py: switch `_hash_api_key` from HMAC-SHA256 to
  keyed BLAKE2b. CodeQL's py/weak-sensitive-data-hashing rule still flags
  any literal `hashlib.sha256` on auth-context data even inside an HMAC
  wrapper. BLAKE2b is a designed-for-keyed-hashing algorithm with the same
  security properties (256-bit digest, MAC mode via the `key=` parameter)
  and is not classified as weak by CodeQL.
- src/recotem/cli.py + 3 test fixtures: mirror the new BLAKE2b construction
  so server and client agree on the stored hash. Wire format
  (`<kid>:sha256:<hex64>`) unchanged — the prefix identifies the digest
  family / 32-byte hex digest, not the keyless construction.
- .github/workflows/test.yml: capture e2e stdout/stderr to
  /tmp/recotem-e2e-output.log via tee so the upload-artifact step has a
  file to upload, and the downstream secrets-in-logs grep job has a file
  to scan. Previously the upload was empty and the next job 404'd on
  download-artifact.

The CodeQL alert at auth.py:52 should clear on the next scan since the
sha256 reference is gone. Tests + e2e verified locally.
Comment thread src/recotem/serving/auth.py Fixed
marevol added 15 commits May 7, 2026 10:04
ruff F401 — `hmac` was used by the previous HMAC-SHA256 construction; the
keyed BLAKE2b helpers in cli.py and the three test fixtures only need
hashlib.blake2b now.
CodeQL's py/weak-sensitive-data-hashing rule flags any non-KDF hash on
auth-context data, regardless of whether the input is a high-entropy
random token or a low-entropy password.  HMAC-SHA256 and keyed BLAKE2b
both triggered the rule.

scrypt is recognised by CodeQL as a key derivation function and is not
flagged.  We use it at minimum cost (n=2, r=8, p=1) — the cost parameter
is irrelevant for our use case because Recotem API keys are 256-bit
random tokens (recotem keygen --type api enforces 32-byte length), so
brute force is infeasible regardless of hash speed.

Wire format unchanged (<kid>:sha256:<hex64>); the prefix identifies the
digest family / 32-byte hex digest, not the construction.

Updated cli.py keygen and the three test fixtures to mirror the same
scrypt parameters so server and client agree on the stored hash.
Removed the redundant codeql.yml workflow (GitHub Default Setup already
runs CodeQL).  Added Python 3.13 to the test matrix.

Security hardening (from gap-analysis review):

- src/recotem/artifact/signing.py: extend _DENIED_MODULE_PREFIXES with
  numpy.lib (DataSource / open_memmap), numpy.compat (legacy shims),
  scipy.sparse.linalg (LinearOperator callable proxies),
  scipy.sparse.tests (test runner), scipy.sparse.csgraph (C extensions).
  These would otherwise be reachable through the broad numpy.* /
  scipy.sparse.* prefix allow.
- src/recotem/serving/app.py: emit signing_keys=[{kid,fingerprint}] in
  the security.posture log line (legacy signing_kids preserved for SIEM).
  Add DEV_ALLOW_UNSIGNED_ACTIVE warning at startup AND in the recurring
  60s banner loop, mirroring the insecure-no-auth pattern.
- src/recotem/serving/auth.py + src/recotem/cli.py docstrings: clarify
  that the construction is scrypt KDF (not HMAC-SHA256, not bare SHA-256);
  the wire prefix sha256: identifies the digest family / 32-byte hex digest.
- docs/superpowers/specs/2026-05-07-recotem-2-design.md §9: align spec
  with the scrypt construction.
- src/recotem/cli.py inspect: add --dev-allow-unsigned to verify dev
  artifacts when RECOTEM_SIGNING_KEYS is unset.

Tests for recently-changed code (31 new tests + 5 deny tests):

- tests/unit/test_serving_auth.py (+6): _hash_api_key determinism,
  collision-resistance, scrypt-under-the-hood, label constant, parameter
  parity with cli.py keygen.
- tests/unit/test_training_pipeline.py (+6): no_lock skips lock,
  contention returns None / raises with fail_on_busy, dev_allow_unsigned
  uses in-memory key, missing signing key clear error,
  get_source_class is the source resolver.
- tests/unit/test_recipe_loader.py (+5): source dict promoted to typed
  CSVConfig / ParquetConfig, attrs accessible, unknown / missing type
  rejection.
- tests/unit/test_artifact_signing.py (+11+5): module-prefix allow /
  deny edge cases, new deny entries (numpy.lib, numpy.compat,
  scipy.sparse.linalg, scipy.sparse.tests, scipy.sparse.csgraph).
- tests/integration/test_serve_predict_e2e.py (+1): full artifact
  roundtrip — train -> sign -> SafeUnpickler read -> predict.
- tests/unit/test_serving_app.py (+3): security.posture event captured
  with signing_keys list of {kid, fingerprint}, dev banner on startup.

Verification:
- uv run pytest tests -m "not slow"  -> 280 / 280 pass
- uv run ruff check src tests          -> clean
- uv run bash tests/e2e/run.sh         -> /predict 200 with valid items
- test.yml: pin matrix back to Python 3.12.  irspack 0.4.0 only ships
  the _core_evaluator C extension for 3.12; 3.13 fails with
  ModuleNotFoundError at test collection.  Documented in a comment so
  the matrix can be re-expanded when irspack publishes 3.13 wheels.
- tests/integration/test_serve_predict_e2e.py + tests/unit/test_serving_app.py:
  ruff format pass (long lines wrapped).
Previously gated to push events only, which meant CVEs were only
caught after merge to main.  Trivy already builds its own local image
via load: true, so it works regardless of whether the parent build
job pushes to GHCR.  Removing the gate catches CVEs at PR time and
removes the misleading "Skipped" status from the PR check list.
…unfixed

PR-time trivy scan flagged 4 HIGH CVEs:
- CVE-2025-62727 (starlette <0.49.1) — Range header DoS in FileResponse.
  Fixed by upgrading fastapi 0.115-0.119 -> 0.120-0.139 (which lifts the
  starlette<0.49 cap) and pinning starlette>=0.49.1 explicitly.
- CVE-2026-4878 (libcap2), CVE-2025-69720 (ncurses-bin),
  CVE-2026-29111 (libudev1) — slim base image OS packages with no
  upstream Debian fix.  Added ignore-unfixed: true to trivy so CI is not
  permanently red on un-actionable findings.  We rebuild on every push,
  so as soon as Debian ships a fix we pick it up automatically.

Local pytest unit+integration: 270 passed.
…-6357

python:3.12-slim ships pip 25.0.1 which trivy flags for three pip CVEs:
- CVE-2025-8869 (MEDIUM, fix in 25.3) — symlink check during tar extraction
- CVE-2026-1703 (LOW,    fix in 26.0) — wheel extraction path traversal
- CVE-2026-6357 (MEDIUM, fix in 26.1) — premature self-update import

We don't actually invoke pip at runtime (uv handles all installs), but
trivy scans the bundled site-packages/pip-*.dist-info regardless.
Upgrading to pip>=26.1 is the smallest patch that closes all three.
The workflow's name field is "recotem / docker" and it does docker
build + scan + (push on main).  Calling the file release.yml was
misleading.

Also fix the paths: filter that referenced non-existent
.github/workflows/recotem-{test,docker}.yml — left over from an
earlier naming scheme.  This means edits to the workflow files
themselves will now correctly retrigger their own workflow on PR.
CRITICAL — append_sha pointer not resolved by serving layer
  artifact.io.resolve_artifact_pointer is now public and the serving
  layer's _read_artifact_bytes calls it before returning bytes, so the
  documented default versioning mode (append_sha) actually works for
  initial load and hot-swap.  Adds an integration test that trains in
  append_sha mode and asserts both read_artifact and the serving helper
  resolve the pointer to the binary container.

MAJOR — /health silently dropped recipes that failed to load
  ModelEntry now carries loaded: bool.  create_app inserts a stub entry
  (loaded=False, last_load_error=<reason>, recommender=None) for any
  recipe whose initial artifact load fails, so /health returns degraded
  with the failing recipe visible to operators.  /predict already 503s
  on recommender=None; /models filters out stubs.

MAJOR — BPRFM was supported by trainer but rejected by SafeUnpickler
  Adds (irspack.recommenders.bpr, BPRFMRecommender) to _ALLOWED_CLASSES
  plus a parity test that fails fast on any future drift between
  SUPPORTED_CLASS_NAMES and _ALLOWED_CLASSES.

MAJOR — dedup: sum_weight was documented but never implemented
  The recipe schema accepted sum_weight; the pipeline collapsed
  duplicates with .first() (== keep_first) and never produced or
  consumed a weight column.  Removed from schema, pipeline branch, and
  docs (recipe-reference, design spec).  Re-add when the trainer can
  actually consume per-interaction weights.
- helm: split securityContext into pod-level and container-level
  values. Pod spec.securityContext no longer carries container-only
  fields (allowPrivilegeEscalation / readOnlyRootFilesystem /
  capabilities), which the Kubernetes API rejects.
- docker: replace `recotem --version` HEALTHCHECK (option does not
  exist, exits 2) with `recotem schema` in Dockerfile and a real
  /health HTTP probe (via stdlib urllib) in docker-compose example.
- cli inspect: resolve append_sha pointer files via the existing
  resolve_artifact_pointer helper before parsing the header, so
  `recotem inspect <output.path>` works as documented in quickstart.
- cli validate: replace the broken `get_source_for_recipe` import
  (which silently swallowed ImportError and skipped the probe) with
  an actual probe that resolves the source class via
  get_source_class() and instantiates it with the recipe's source
  config — catches missing extras and Config schema mismatches.
- ci: restore .github/workflows/codeql.yml (Python CodeQL on push,
  PR, and weekly cron) which the spec and CLAUDE.md mandate but had
  been removed from this branch.
…check

`recotem validate` previously only instantiated the source class, which
caught missing extras / Config schema mismatches but never touched the
storage backend or BigQuery service.  docs/plugin-authoring.md already
documents an optional `probe()` method, but neither the CLI nor any
built-in source implemented it.

- Add `probe()` to CSVSource and ParquetSource — checks file exists via
  fsspec on its resolved backend (works for local, s3://, gs://, az://)
  without loading data.
- Add `probe()` to BigQuerySource — runs a `dry_run=True` job that
  exercises ADC, client construction, query syntax, and parameter
  binding without consuming bytes (dry-run is free).
- CLI `validate` calls `probe()` via `hasattr` when defined; otherwise
  reports "extras OK (no probe defined)".  Sources that fail to probe
  exit with the spec-mandated DataSourceError code 3.
- Document the optional `probe()` hook on the DataSource Protocol
  docstring so plugin authors know it is part of the contract.
- Add unit tests for both probe paths (existing CSV → exit 0 with
  "probe OK", missing CSV → exit 3 with "CSV file not found").
- src/recotem/datasource/csv.py: re-format the new _probe_fsspec_path
  helper to satisfy `ruff format --check` (line-joining of short error
  messages); the lint check passed locally but I forgot to run the
  formatter check before pushing.
- .github/workflows/test.yml: change `push: branches:
  docs/recotem-2-design` to `push: branches: main`.  With the previous
  setting every commit to the feature branch fired the test workflow
  twice (once via push, once via pull_request) — see runs 25479516525
  vs 25479519250 on PR #85.  Aligns this workflow's trigger with
  docker.yml and codeql.yml, which already use main + pull_request.
Resolves four code-review gaps spanning hidden test failures, broken
example data, and stale docs:

- training/pipeline.py: docstring for `write_artifact_fn` referenced an
  obsolete signature `(payload, header_dict, output_path, versioning,
  key_ring, signing_key)` that doesn't match `recotem.artifact.io.
  write_artifact`'s actual `(payload_obj, header_dict, key_ring,
  fs_path, *, versioning)`.  Fix the docstring; fix the matching test
  mock in test_training_pipeline.py.  The pre-existing failure was
  hidden because the test is `@pytest.mark.slow` and CI deselects
  slow tests via `-m "not slow"`.

- examples/csv-local/: ship a deterministic synthetic
  interactions.csv.gz (221 rows / 20 users / 15 items / timestamps)
  so the recipe is runnable and `recotem validate` no longer fails
  with exit 3 (DataSourceError) on a missing file.  Now the example
  satisfies its own min_rows=100 / min_users=10 / min_items=10
  cleansing thresholds.

- README.md / docs/quickstart.md: fix keygen drift.  Signing keys
  are 64-char hex (not 43-char base64url); api keys are generated
  separately via `recotem keygen --type api` and produce a 43-char
  base64url plaintext + scrypt hash.  Show both flows with realistic
  output and clearly distinguish RECOTEM_SIGNING_KEYS from
  RECOTEM_API_KEYS so copy-paste actually works.

- README.md / docs/quickstart.md / docs/operations.md / docs/
  deployment/{k8s,docker,cron}.md / CLAUDE.md: replace stale port
  8000 references with the actual default 8080 (config.py
  `_DEFAULT_PORT = 8080`).  All deployment configs (helm chart,
  docker-compose example) already use 8080; the docs were the only
  drift.
Three confirmed gaps from review:

1. recotem train was emitting `training_done` (no fields) and
   `artifact_written` (only `artifact`). Spec Section 6 step 9 / Section 10
   require a single `train_done` JSON line with name, run_id, exit_code,
   artifact, best_class, best_score, trials, trained_at, kid. Renamed the
   phase marker to `final_model_trained`, added the canonical `train_done`
   emission in the pipeline, and added matching `train_error` emission
   from the CLI on failure (with exit_code, code, error, recipe, run_id,
   trained_at). TrainResult now also carries kid + trials.

2. /metrics was exposed whenever prometheus_client was importable, but
   docs/operations.md says RECOTEM_METRICS_ENABLED=true is required.
   Docs also listed six recotem_* metrics that were never instantiated.
   Added serving/metrics.py: env-gated exposition plus all six metrics
   (predict_total, predict_latency_seconds, model_loaded,
   artifact_load_failures_total, active_recipes, swap_total). Wired
   recorders into routes.predict, app initial-load, and watcher
   load/swap/remove paths. No-op when prometheus_client absent.

3. README hello-world wrote `recipe.yaml` but served `--recipes ./recipes/`
   — serve dir was empty, /predict would 404. Fixed to `recipes/top_picks.yaml`
   throughout, with explicit mkdir.
…tests

Three confirmed gaps from review:

1. recotem keygen defaults to --type signing, but the API-key rotation
   step in docs/operations.md ran `recotem keygen` with no flag — would
   silently produce a SIGNING key. Helm secret-example.yaml had the same
   omission for the API key block, plus used --kind signing (a flag that
   does not exist; the actual flag is --type). Fixed both:
   - operations.md API rotation now uses `recotem keygen --type api --kid <id>`
     and shows the real env_entry= line
   - secret-example.yaml uses --type api / --type signing and the actual
     CLI output schema (kid=, plaintext=, hash=, env_entry=)

2. operations.md 401 troubleshooting said "confirm hash matches sha256
   of plaintext", but the verifier is hashlib.scrypt with the
   recotem.api-key.v1 salt. The sha256: literal in the wire format is a
   digest-family prefix, not the algorithm name. Updated the
   troubleshooting bullet to spell out the scrypt parameters and clarified
   ApiKeyEntry.sha256_hex docstring (field name kept for backward compat).

3. test_metrics_endpoint_off_by_default accepted both 200 and 404 so it
   pinned no behavior, and there was no test for the
   RECOTEM_METRICS_ENABLED=true path or for the presence of any
   recotem_* metric. Replaced with three tests that drive the env via
   monkeypatch (so make_router sees the gate value) and cover:
   - 404 when env unset
   - 404 when env falsy
   - 200 + all six documented metrics + correct {recipe="..."} labels
marevol added 4 commits May 9, 2026 22:08
…h exit code

Doc/spec drift:
- rename pyproject extra `az` to `azure` to match docs/data-sources/csv.md
- docs/deployment/cron.md: train_skipped → recipe_lock_contended_skipping
- CLAUDE.md: RECOTEM_HOST default 0.0.0.0 → 127.0.0.1 (matches code)
- docs/deployment/k8s.md: Secret name recotem-secrets → recotem-auth (matches helm/examples)
- recipe.item_metadata: add configurable item_id_column field (loader already
  read it via getattr but the model omitted the field, making it unreachable
  under extra="forbid")

Hardening:
- serving: drop `or ["*"]` TrustedHost fallback; ServeConfig now guarantees a
  non-empty allowed_hosts default for empty/unset RECOTEM_ALLOWED_HOSTS while
  keeping ALLOWED_ORIGINS / API_KEYS at default-deny
- training: route unexpected fetch exceptions through DataSourceError so the
  CLI exits 3 (datasource) per docs/operations.md, not 4 (training)
- examples/plugins/echo-source: implement probe() to match README and serve as
  a worked example for plugin authors; sync docs/plugin-authoring.md walkthrough
- helm + examples/k8s: set terminationGracePeriodSeconds: 35 to honor the
  RECOTEM_DRAIN_SECONDS+5 recommendation in docs/operations.md
- docs/deployment/k8s.md: align Service selector with examples/k8s
  (app.kubernetes.io/component=serve)
- CLAUDE.md: document RECOTEM_METRICS_ENABLED in env-var table
- .gitignore: ignore stray scheme-prefixed dirs (s3:/, gs:/, az:/)

Tests (+21):
- recipe loader: item_id_column YAML round-trip and default propagation
- echo-source: probe() happy/negative paths, fetch contract, plugin protocol
- serving: regression tests pinning ALLOWED_ORIGINS/API_KEYS default-deny
- cli: DataSourceError → exit 3 / TrainingError → exit 4 mapping

Full suite: 425 passed, ruff clean.
Round of review-driven fixes for Recotem 2.0:

- SSRF: refuse private/loopback/link-local/reserved destinations on every
  HTTP fetch hop (recipe → CSV/Parquet/metadata source) so a malicious
  recipe cannot reach cloud-metadata services. Operators with internal
  origins opt in via RECOTEM_HTTP_ALLOW_PRIVATE.
- Optuna storage_path embedding userinfo (postgresql://user:pass@…) is
  rejected before SQLAlchemy traces can leak the credentials.
- _split_csv_env now falls back to default on whitespace/comma-only input
  (previously yielded [] → TrustedHostMiddleware 400'd every request).
- _compute_budgets honours sum==n_trials when n_trials < explicit class
  count (gives 1 each to first n_trials non-zero classes, 0 to the rest).
- run_training emits the canonical train_error event itself so library
  callers receive it; CLI no longer double-emits.
- watcher.stop() is now followed by join(timeout) for clean shutdown;
  watcher annotates load failures via a registry-locked
  ModelRegistry.set_load_error rather than mutating dataclass fields
  directly.
- per-trial timeout logs per_trial_timeout_thread_orphaned WARN so
  orphaned learn threads (irspack C extensions can't be killed) are
  observable.
- Drop unreachable TSV branch from metadata loader; ItemMetadataConfig
  only accepts csv/parquet.
- Doc updates: SSRF threat model, NetworkPolicy guidance for
  /metrics+/health, RECOTEM_LOCK_DIR, header *_version fields,
  storage_path credential rejection.

Tests: 47 new functions across 10 files (4 new + 6 extended) covering
each fix plus prior coverage gaps (HTTP timeout firing, MAX_REDIRECTS
exhaustion, kid-bytes tamper, runtime YAML pickup, per-trial timeout
zombie). Suite: 475 + 1 slow + 21 integration/fuzz, ruff clean.
…n-id, PyPI publish

- Add pyarrow to core deps and ParquetSource defensive import so
  built-in Parquet works on a fresh `pip install recotem`.
- Stop `recotem validate` from bypassing _http_fetch's SSRF guard /
  redirect / cap / sha256 controls: probe() runs assert_host_public
  for http(s) and returns instead of calling fsspec.exists().
- Add `recotem train --run-id` so CLI users can resume a persistent
  Optuna study; document the option in recipe-reference.
- Add publish workflow that builds and ships sdist+wheel to PyPI on
  v* tags via OIDC Trusted Publishing (no API tokens).
Comprehensive sweep of 6 specialized review reports.  All findings either
fixed, accompanied by new tests, or reflected in docs.  Test count grew from
~478 to 516 (38 new unit tests across 11 files).

## Critical
- Helm NetworkPolicy: `from: []` matched ALL sources (opposite of deny-all
  intent). Now renders `ingress: []` — canonical Kubernetes deny-all-inbound.
- CLI `train` SystemExit collapse: `exc.code or 0` mapped None to success.
  Now: only literal int(0) is success; everything else maps to non-zero.

## Security / error-handling
- recipe/loader: symlink-on-create escape via _check_local_output_containment
  closed by resolving parent with strict=True before containment check.
- recipe/envvars: TOKEN/KEY blacklist patterns added (defence-in-depth even
  though the RECOTEM_RECIPE_* prefix check fires first).
- recipe/loader: structured pydantic ValidationError formatting (loc + msg
  per error) replaces the bare Exception flatten.
- recipe/loader: source field now validated through pydantic instead of
  object.__setattr__, so extra="forbid" is enforced end-to-end.
- datasource/bigquery: storage-API fallback now catches only ImportError +
  GoogleAPICallError; logs `bigquery_storage_fallback`. MemoryError and
  similar propagate.
- datasource/csv: `RECOTEM_MAX_DOWNLOAD_BYTES` now also caps local
  (Path.stat) and object-store (fsspec.info) reads, not just HTTP/HTTPS.
- serving/watcher: three Exception swallow sites narrowed.
  - `_extract_kid_safe`: only catches IndexError/UnicodeDecodeError/ValueError
  - `_stat_marker`: distinguishes FileNotFoundError from genuine I/O
    failures; emits `artifact_stat_failed` + new
    `recotem_artifact_stat_failures_total` metric.
  - `run()`: tracks consecutive errors; after 5, marks every recipe
    "watcher unhealthy" via registry.set_load_error so /health degrades.
    New `recotem_watcher_unhandled_errors_total` metric.
- serving/routes: `RECOTEM_METADATA_FIELD_DENY` is now case-insensitive.
- training/search: per-trial-timeout orphans tracked
  (`SearchResult.orphaned_count`), exposed in artifact header
  `tuning.n_orphaned` and in the canonical `train_done` event.  When more
  than half of completed trials orphaned, raise TrainingError(code=
  "excessive_per_trial_timeouts").
- training/search: unknown algorithm in `per_algorithm_trials` raises
  TrainingError(code="unknown_algorithm_in_budget") instead of being
  silently dropped.
- training/pipeline: non-domain exceptions in `train_error` carry
  code="internal_error" instead of `type(exc).__name__`.
- training/lock: full sha256 hexdigest (64 chars) for remote lock paths.

## CLI exit code map
Added dedicated codes — was 6, now 9:
  6 = LockContestedError, 7 = HttpFetchError, 8 = configuration error
  (missing signing keys, port already in use).  Documented in cli.py
  module docstring and propagated to docs/operations.md and
  docs/deployment/k8s.md.
- `recotem serve` now wraps uvicorn.run, mapping OSError → 8 and
  KeyboardInterrupt → 0 with structured `serve_startup_failed` /
  `serve_terminated` events.
- `recotem inspect` artifact arg loses `exists=True` (allow s3://, gs://).

## Module structure
- `recotem._idmap` (NEW) — neutral home for IDMappedRecommender.  Removes
  the cross-package serving._compat → training._compat import that
  violated CLAUDE.md ("training/ and serving/ never import each other").
  Self-installs the irspack/fastprogress IPython stub so direct importers
  (e.g. serving) work without going through the training package first.
- `recotem.log_redaction` (NEW, top level) — moved from
  `recotem.serving.log_redaction`.  Train-only invocations no longer pull
  in the serving package.
- `recotem.serving._compat` — DELETED (migration is alpha-breaking; no
  backward compat for already-pickled artifacts is in scope at 2.0.0a0).
- artifact/signing FQCN allow-list: replaced
  `recotem.{training,serving}._compat.IDMappedRecommender` with
  `recotem._idmap.IDMappedRecommender`.

## CI / Docker / Helm
- Dockerfile: pin `ghcr.io/astral-sh/uv:0.7` (was `:latest`); HEALTHCHECK
  now probes `/health` via urllib instead of `recotem schema`.
- docker.yml: trivy-action now produces SARIF and uploads via
  `github/codeql-action/upload-sarif` (CVEs surface in the Security tab).
  `security-events: write` permission added to the trivy job.
- publish.yml: top-level `permissions: contents: read`; runner pinned to
  `ubuntu-24.04`; action versions synced with test.yml (checkout@v6,
  setup-uv@v8.1.0, upload/download-artifact@v7).
- nightly.yml (NEW): scheduled @ 06:00 UTC runs `-m slow` and slow e2e on
  main; gated to the upstream repo.
- helm/recotem: `/workspace` emptyDir mount added to match Dockerfile
  WORKDIR under `readOnlyRootFilesystem: true`. startupProbe added for
  cold-start headroom.
- dependabot.yml: switched ecosystem from `pip` to `uv`.
- .dockerignore (NEW): trim build context.
- compose.yaml: comment clarifies train service intentionally omits
  RECOTEM_ALLOWED_HOSTS / API key vars.

## Tests added (38 new)
- HTTP SSRF: DNS rebinding, HTTPS→HTTP scheme-change, redirect loop
  (visited-set), IPv6 link-local + ULA refusal.
- Artifact signing: subprocess test for direct `recotem._idmap` import,
  numpy.lib / numpy.compat pickle-stream rejection (both via find_class
  and via raw pickle opcodes).
- CLI: SystemExit code=None / code=0 / OSError / LockContested /
  HttpFetchError / missing signing keys exit-code mappings.
- Recipe loader: symlink-in-parent rejection, TOKEN/KEY blacklist,
  multi-error pydantic location reporting, source extra=forbid enforcement
  after typed validation.
- Datasource: local CSV/Parquet over byte cap rejected; BigQuery storage
  fallback emits log event; MemoryError propagates instead of falling back.
- Training: orphan trial counter, excessive-orphan TrainingError, unknown
  algorithm in budget, train_done canonical fields, train_error
  internal_error code, full sha256 lock filename.
- Watcher: append_sha pointer hot-swap unit test.
- Routes: case-insensitive metadata deny.

## Docs
- docs/security.md FQCN allow-list rewritten to make HMAC the primary
  gate and the prefix allow-list defence-in-depth (with explicit
  not-RCE-backstop framing); FQCN table updated to recotem._idmap.
- docs/operations.md: new CLI flag reference; exit-code table grown to
  0–8; new metrics rows; train_done now lists n_orphaned.
- docs/deployment/docker.md: env table completed (7 missing rows added);
  `:latest` tag publishing documented (matches docker.yml).
- docs/deployment/k8s.md: NetworkPolicy deny-all behaviour explained;
  exit codes synced.
- docs/data-sources/csv.md: byte-cap behaviour documented for all schemes.
- docs/plugin-authoring.md: corrected the inverted `recotem schema`
  plugin-Config claim (plugin Configs DO appear).
- docs/recipe-reference.md: `--env-var` example added.
- README.md: shell-convenience var renamed `RECOTEM_API_KEY` →
  `RECOTEM_API_PLAINTEXT` to match getting-started.md.
- CLAUDE.md: directory layout reflects `_idmap.py`, `_http_fetch.py`,
  `log_redaction.py` at top level; structlog redaction processor path
  updated; module-isolation note clarified.
- Code comments: removed dangling `(spec Section N)` references in cli,
  config, signing, training/pipeline, serving/{auth,app,routes,watcher}.
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

marevol added 23 commits May 10, 2026 09:48
Security
- Tighten artifact FQCN module-prefix allow-list to numpy._core./numpy.core./
  numpy.dtypes./scipy.sparse._csr|_csc|_coo. only; reject numpy.frompyfunc,
  numpy.vectorize, scipy.sparse.load_npz, etc.
- Mitigate DNS rebinding: per-request _PinnedHTTPConnection / _PinnedHTTPSConnection
  pin the resolved IP for connect while keeping SNI + cert verification on the
  original hostname. Re-resolve + re-pin on every redirect hop.
- Reject IPv4-mapped IPv6 (::ffff:a.b.c.d) by unwrapping and re-checking as IPv4.
- Gate `recotem inspect --dev-allow-unsigned` behind RECOTEM_ENV=development
  (mirrors train/serve).
- Cap X-API-Key header at 256 chars before scrypt to neutralise KDF DoS amplifier.

Observability / training
- Emit `exit_code` on every train_error event; lift `n_rows`, `n_users`,
  `n_items`, `min_rows`, `min_users`, `min_items` for MinDataViolation.
- Walk __cause__ in _map_exception_to_exit so DataSourceError-wrapped
  HttpFetchError still maps to exit 7 (preserves CronJob retry semantics for
  SSRF guard, network failures, sha256 mismatch on network fetch).
- Validate `per_algorithm_trials` keys against `training.algorithms` at
  recipe load time (was silently ignored).
- Require `source.time_unit` for numeric timestamp columns (was silently
  parsed as nanoseconds).
- Scope ZeroScoreError to non-`hit` metrics; warn on parallelism>1 with
  per_algorithm_trials due to budget race.
- Drop redundant fcntl.LOCK_UN; case-insensitive algorithm alias resolution.

Serving watcher
- Insert stub ModelEntry into registry before first load attempt so initial
  load failures show up in /health.
- Surface artifact disappearance via set_load_error + load-failure metric.

Infra
- publish.yml: refuse workflow_dispatch without a v* tag ref.
- docker.yml: align CodeQL upload-sarif to v4, add .dockerignore to paths,
  add Trivy GHA cache.
- nightly.yml: include tests/fuzz in slow-tests job.
- Helm: add /workspace emptyDir to CronJob (matches Deployment); validate PDB
  minAvailable/maxUnavailable mutual exclusion; allow kubelet probes through
  NetworkPolicy by default.
- examples/k8s: add fsGroup: 1000 + /workspace mounts.

Tests / hygiene
- Add ~99 tests covering all changes plus previously-uncovered gaps:
  HTTP byte-cap streaming, query_parameters env-injection refusal, real
  ArtifactWatcher hot-swap, FQCN allow-list end-to-end, plugin EP load
  failure, JSON log format, SSRF via train CLI, scrypt cost factor pinning,
  artifact length-field fuzz, YAML anchor fuzz.
- conftest._isolate_structlog_state autouse fixture neutralises CLI tests
  that otherwise leave structlog cache_logger_on_first_use=True polluting
  capture_logs() in subsequent tests.

Docs
- operations.md / k8s.md exit-code tables: clarify exit 3 = source-layer
  structural error, exit 7 = any HTTP fetch failure (including sha256
  mismatch on network sources).
- security.md: document DNS rebinding mitigation + IPv4-mapped IPv6 guard.
- cron.md: present secrets-file pattern first; relegate plaintext to a
  DO-NOT example.
- docker.md: fix HEALTHCHECK description (HTTP /health, not `recotem schema`);
  swap curl example for stdlib urllib (slim image has no curl); align
  RECOTEM_WATCH_INTERVAL with compose.yaml/Helm.
- recipe-reference.md: document `time_unit` field; document strict
  `per_algorithm_trials` validation and parallelism>1 budget caveat.
The floating @v8 tag is no longer published since setup-uv v8.0.0
adopted immutable releases, breaking all workflows.
`_is_address_internal` evaluated `addr.is_reserved` and friends on the
raw IPv6 address before consulting `ipv4_mapped`. On CPython 3.12.0–3.12.3
the `::ffff:0:0/96` block was incorrectly listed in IPv6 reserved-networks,
so `IPv6Address('::ffff:8.8.8.8').is_reserved` returned True and the
function early-returned True for legitimately public addresses.

Reorder so `ipv4_mapped` is checked first: when present, the embedded
IPv4 is the authoritative source of truth and the IPv6 properties are
not consulted. Pure IPv6 addresses still hit the original property
check unchanged.

Fixes the test_is_address_internal_unwraps_ipv4_mapped failure in
PR #85 CI on Python 3.12.3.
Source code fixes
- training/lock.py: drop finally unlink (flock-deletion race that allowed two
  trainers to hold the same per-recipe lock); narrow OSError catch to
  EWOULDBLOCK/EACCES/EAGAIN so EBADF/ENOLCK/EIO propagate
- serving/watcher.py: reuse a single ThreadPoolExecutor (was rebuilt every
  poll cycle); shut it down inside run()'s finally so stop() can return
  while a poll is in flight; emit artifact_read_failed structured log;
  fix _poll_artifacts indentation that orphaned the marker handling inside
  the except block; decode header JSON before unpickle for clearer errors
- serving/app.py: wrap startup json.loads(header) so a corrupt header lands
  in /health as loaded=False instead of crashing the server; reorder so
  header decode runs before unpickle
- _idmap.py: pre-check user membership and only convert truly-unknown user
  ids to KeyError; let other RuntimeErrors propagate so 404 means missing
  user, not internal model-state corruption
- _size_cap.py (new): shared helper enforcing RECOTEM_MAX_DOWNLOAD_BYTES
  on local + object-store paths; correctly resolves file://localhost via
  urllib.request.url2pathname
- metadata/loader.py: enforce the cap on local + object-store reads
  (previously HTTP-only); preserve literal "nan" item ids by reading with
  keep_default_na=False and detecting nulls before str-coercion
- datasource/csv.py: route file:// size-cap through the shared helper
- config.py: clamp RECOTEM_MAX_ARTIFACT_BYTES to [1 MiB, 16 GiB]
- cli.py: keygen --type signing emits fingerprint=<8hex> matching the
  /security.posture log line, instead of the misleading hash=sha256:
  (the api keygen path keeps hash=sha256: because that IS the digest stored
  in RECOTEM_API_KEYS); inspect exits non-zero when HMAC is skipped without
  --dev-allow-unsigned, with help text noting why inspect does not require
  --i-understand-this-loads-arbitrary-code (it never unpickles)
- serving/auth.py: enforce 32-char minimum on X-API-Key plaintext at verify
  time so leaked digests of low-entropy operator-set keys are not offline-
  brute-forceable; recotem keygen --type api output (64 hex) is unaffected
- log_redaction.py: glob now covers azure_*, *_token*, *_key* matching the
  recipe/envvars.py blacklist

Infra / CI
- helm: deployment template prepends localhost to RECOTEM_ALLOWED_HOSTS so
  ingress overrides do not 400 the kubelet probes; image tag aligned to
  2.0.0a0 (Chart.yaml uses SemVer 2.0.0-alpha.0); values.yaml gains
  commented-out placeholders for RECOTEM_HTTP_TIMEOUT_SECONDS,
  HTTP_ALLOW_PRIVATE, MAX_DOWNLOAD_BYTES, METRICS_ENABLED
- examples/k8s + docs/deployment: image tag aligned to 2.0.0a0
- .github/workflows/test.yml: tighten secrets-in-logs grep to JSON value
  position so legitimate kid:hex64 mentions (recipe_hash, artifact path)
  do not false-positive
- pyproject.toml: add addopts = ["-m", "not slow"] so slow tests are
  deselected by default; nightly.yml passes --override-ini='addopts=' to
  run the full suite

Documentation
- security.md: clarify metadata-deny is case-INSENSITIVE; add Azure /
  *_TOKEN* / *_KEY* to redaction + env-expansion blacklist sections;
  document URL userinfo redaction at the HTTP-fetcher boundary; document
  the 32-char API key minimum and the keygen output formats
- recipe-reference.md: blacklist patterns aligned with security.md
- operations.md: keygen sample output, RECOTEM_METRICS_ENABLED truthy
  values (1/true/yes/on)
- data-sources/csv.md: clarify validate runs DNS + SSRF guard but issues
  no HTTP request
- examples/quickstart/README.md: rename shell variable to match the rest
  of the docs (RECOTEM_API_PLAINTEXT)
- CLAUDE.md: list quickstart example, document _size_cap helper

Tests
- New: header_json byte tamper end-to-end, KeyRing forward rotation,
  ../-traversal in artifact root, ftp:// output rejection, redaction
  processor first-in-chain runtime check, /predict increments
  predict_total, BigQuery validate probe, inspect tampered HMAC exit,
  --run-id propagation, schema includes all source types, watcher does
  not reload on sha256-unchanged, per-algo zero-budget enqueues no
  trials, CSV source sha256 mismatch, Content-Length cap behavior
- Per-fix tests: lock sentinel persists across acquire/release, OSError
  propagation, watcher executor reuse, watcher read-failure log,
  serving/app json.loads wrap, _idmap RuntimeError propagation,
  metadata size cap + literal "nan" preservation, file://localhost
  cap, MAX_ARTIFACT_BYTES clamp, keygen fingerprint, inspect exit
  code, API key min length, log_redaction azure/token/key

697 passed (1 deselected), ruff lint + format clean.
…ode parity

Implements fixes from a comprehensive 5-agent review (security / architecture /
test coverage / code quality / docs).

Security
- auth.py: equalise scrypt timing on short / oversized X-API-Key paths so the
  length check does not leak a length oracle (calls _hash_api_key against a
  zero-padded dummy before raising).
- auth.py: annotate the scrypt call with `# codeql[py/weak-sensitive-data-hashing]`
  + rationale; the input is a 256-bit random token so the low scrypt cost is
  intentional (raising it would amplify DoS on the unauthenticated endpoint).
- recipe/loader.py: switch _validate_output_path to an allow-list of
  `{"", "file", "s3", "gs", "az", "abfs", "abfss"}`; `data:`, `javascript:`,
  and unknown schemes are now rejected explicitly.
- training/lock.py: silence CodeQL py/world-readable-file false positives —
  the lock file is opened with mode 0o600 (owner-only).

Serving / observability
- routes.py: set X-Request-ID response header; echo client-supplied value
  when present, otherwise generate a UUID4. Closes the spec contract.
- routes.py + metrics.py: split /predict Prometheus status labels into
  `ok / user_not_found / unavailable / error` so 404 / 503 / generic 5xx
  no longer collapse into a single "error" bucket.
- watcher.py: route _loaded_marker mutations through
  ModelRegistry.update_loaded_marker so the write happens under the registry
  lock alongside replace().

Config / payload cap
- config.py: add RECOTEM_MAX_PAYLOAD_BYTES (default 512 MiB, clamped
  [1 MiB, 16 GiB]) as a separate cap for serve-side deserialization.
  Bounds memory expansion when unpickling.
- watcher.py / app.py: parse_header_from_bytes now uses
  ServeConfig.max_payload_bytes for the payload cap; max_artifact_bytes
  remains the outer container cap.

Naming / spec drift
- cli.py + training/search.py + docs/operations.md: rename user-facing
  references from `tuning.storage_path` to `training.storage_path` to match
  the recipe schema field.

Code quality
- training/search.py: add _MAX_LIVE_ORPHANED_THREADS = 32 ceiling for
  per-trial timeout orphans; emit periodic WARNs every 16 orphans, raise
  SearchError when the ceiling is reached.
- cli.py: narrow _configure_logging_from_env's broad except to
  (ImportError, OSError); print other failures to stderr without aborting
  the CLI.

Docs / examples
- docs/operations.md, docs/getting-started.md, examples/csv-local/README.md,
  examples/tutorial-purchase-log/README.md: align append_sha output filenames
  with the implementation (`<base>.<sha>.recotem`, dot separator).
- docs/data-sources/csv.md + recipe-reference.md: list every accepted source
  scheme (`file://`, `http(s)://`, `s3://`, `gs://`, `az://`, `abfs(s)://`).
- docs/security.md: note that fsspec object-store schemes inherit the pod's
  cloud credentials; SSRF guards apply only to HTTP/HTTPS fetch.
- examples/k8s/README.md: spell out the `recotem-recipes` ConfigMap creation
  prerequisite.
- CLAUDE.md: fix RECOTEM_LOG_FORMAT cell pipe escaping; add
  RECOTEM_MAX_PAYLOAD_BYTES row.
- compose.yaml: extend serve healthcheck start_period to 60s.

Tests
- 30+ new tests covering: constant-time auth paths, hmac.compare_digest
  runtime invocation, output path scheme allow-list, YAML alias-bomb
  resistance (fuzz), embedded-credential URL handling in fetch_http_bytes,
  HMAC-failure-prevents-unpickle, corrupt pointer file handling, payload
  cap enforcement, RECOTEM_MAX_PAYLOAD_BYTES clamp, RECOTEM_ENV variants
  (case-insensitive accept / reject), X-Request-ID echo + generation,
  /predict status label coverage, registry update_loaded_marker locking,
  watcher uses the registry setter, --insecure-no-auth HTTP path,
  orphan thread ceiling, training.storage_path naming in CLI / search
  errors, lock sentinel 0o600 regression.

745/745 tests pass; ruff check + format clean.
…vability, test gaps

Resolves review findings from the 5-agent comprehensive review (completeness,
security, tests, code quality, doc parity).  No CRITICAL items; this round is
all MAJOR/MINOR drift, defense in depth, and test gap closure.

Doc parity:
- csv.md: HTTP fetch failures map to exit 7 (HttpFetchError), not 3.
- security.md: urllib default redirect cap is 10 (was incorrectly stated as
  30); /predict 503 only emits recipe_unavailable (recipe_unhealthy was
  documented but never emitted by routes.py).
- recipe-reference.md: output schemes list now includes abfs:// and abfss://
  to match the loader allow-list; replaced the unsubstantiated
  "validated again immediately before any filesystem or URL use" claim with
  the actual mechanism (validate_assignment=True + exported helper).
- operations.md: events table now lists metadata_source_redirect /
  metadata_source_size_exceeded so operators alert on both fetch paths.
- CLAUDE.md: added the full 0–8 CLI exit code table.

Code:
- recipe/models.py: Recipe model_config gains validate_assignment=True so any
  post-construction mutation re-runs _validate_name (closes the doc claim).
- serving/watcher.py: future-raise path now increments
  recotem_artifact_stat_failures_total; startup _stat_marker call now passes
  recipe_name= so the failure log shows the real name instead of "<unknown>".
- training/lock.py: per-process dedup of the recipe_lock_local_only WARN so
  hourly CronJobs writing to s3:// don't spam the log; lock open now sets
  O_NOFOLLOW (defense in depth against symlink swaps between invocations).
- _http_fetch.py: comment on _PinnedHTTPSHandler.https_open now describes
  the actual behavior (always uses system trust store, ignores caller ctx).
- artifact/io.py: _is_local_fs no longer accepts AbstractBufferedFile (a
  file-like base class, not a filesystem) — dead code removal.

Tests (+62 new):
- artifact_format: strengthen truncated_mid_payload (now asserts HMAC verify
  rejects, not silently passes); add header_len=0xFFFFFFFF rejected without
  4 GiB allocation.
- cli: 6 new exit-code tests (exit 0 success, 1 unexpected, 4 trials-failed,
  4 min-data-violation, 0 lock-contention default skip, 6 fail-on-busy).
- training_pipeline: timeout_before_first_trial raises; per-trial timeout
  excludes orphaned trial from best_score.
- training_split: empty test set raises SplitError.
- http_fetch_ssrf: https → file:// redirect rejected.
- datasource_csv: documents that fetch() returns whatever columns the file
  has (column validation is the pipeline's job).
- datasource_bigquery: parameterised queries bind via @param placeholders,
  not string interpolation.
- recipe_loader: env-var expansion is one-pass (self-recursive and mutually
  recursive cases both terminate without looping).
- serving_routes: predict on a removed recipe returns 503 recipe_unavailable.
- recipe_models: name re-validation on assignment (A3).
- serving_watcher: stat-failure metric increment (D1) and startup log carries
  recipe name (D5).
- training_lock: remote-output WARN dedup (D2) and O_NOFOLLOW symlink refusal
  (D7).

Verification: 767 unit + 37 integration/fuzz pass; ruff check + format clean.
…umulative

`run_search` incremented `_orphaned_count` on every per-trial timeout but
never decremented it when the orphaned thread eventually returned from its
learn step.  The ceiling check `_MAX_LIVE_ORPHANED_THREADS = 32` therefore
fired on the cumulative orphan count, contradicting the constant's own
"simultaneously-live" comment and aborting long studies that accrued 32
timeouts spread out over time even when no orphan thread was still alive.

Split the counter into `_orphaned_live` (decremented when the orphaned
`_learn` thread returns; drives the ceiling) and `_orphaned_total` (never
decremented; surfaced as `SearchResult.orphaned_count` and the periodic
`training_orphaned_threads` warning's `total` kwarg).  The `_learn` body
now sits inside a `try/finally` that records completion and decrements
`_orphaned_live` iff the trial was previously marked orphaned.  The
orphan-marking site re-checks `trial_state["completed"]` under
`_orphan_lock`, closing the race where the thread finishes between
`thread.is_alive()` returning True and lock acquisition; in that race the
trial falls through to the normal completion path instead of leaking a
phantom orphan into the counter.

Three new unit tests cover the live-decrement, the concurrent-count
ceiling, and the cumulative `result.orphaned_count` field.  The existing
`test_orphan_thread_periodic_warn` is updated to assert the new
`live`/`total` kwargs.
…t gap closure

Security (recipe/envvars, log_redaction, recipe/loader, _http_fetch, serving):
- Env-var blacklist now matches credential keywords as substrings (covers
  RECOTEM_RECIPE_APIKEY, _PASSWD, _AUTH, _BEARER, _CRED, _PRIVATE which
  the previous *_KEY*-style globs let through). log_redaction key matcher
  rebuilt with the same rule set plus a value-side scrubber that masks
  64-hex and 43-base64url substrings inside non-redacted string values.
- DataSource Protocol gains `no_expand_fields: ClassVar[frozenset[str]]`
  so plugins can opt fields out of env expansion (closes the SQL-injection
  re-entry path for plugin authors who name their template field anything
  other than `query` / `query_parameters`).
- Input path scheme allow-list (`""`, `file`, `s3`, `gs`, `az`, `abfs`,
  `abfss`, `http`, `https`); chained-scheme paths (`simplecache::https://`)
  rejected. `_check_userinfo` scoped to http(s)/ftp(s) so legitimate
  `gs://project@bucket/key` and `s3://account@bucket/key` are accepted.
- Artifact write-time RECOTEM_ARTIFACT_ROOT re-validation closes the
  TOCTOU between recipe load and `os.replace`.
- SSRF guard: bogon ranges (CGNAT 100.64.0.0/10, RFC2544 198.18/15,
  TEST-NET, NAT64 64:ff9b::/96, RFC3849 docs) added; `_PinnedHTTPSHandler`
  now passes the `ssl.create_default_context()` through to the
  `_PinnedHTTPSConnection` instead of discarding it.
- Auth: missing-header branch now invokes the same scrypt warm-up as the
  oversized/short branches, removing a residual length-based timing
  oracle.
- CORS middleware now sets `allow_credentials=False` and includes
  `OPTIONS` in `allow_methods` for clarity.

Implementation gaps (datasource/csv, metadata/loader, training/pipeline,
serving/registry, cli):
- sha256-path reads now `f.read(cap + 1)` and reject when the buffer
  exceeds RECOTEM_MAX_DOWNLOAD_BYTES; previously stat-failure on object
  stores let oversized files load fully into memory before the digest
  check.
- `_compute_recipe_hash` JSON serialization gains `default=str` so
  Decimal / datetime values from BigQuery `query_parameters` no longer
  raise TypeError mid-train.
- `models_dict()` no longer strips `("hmac", "key")` from the header
  (those keys never appear in `header_json`; the strip suggested the
  header could carry key material).
- `recotem inspect` honors `RECOTEM_MAX_PAYLOAD_BYTES` instead of
  hard-coding the 2 GiB ceiling.

Deploy / config / docs (Dockerfile, helm, .github/workflows/publish,
config.py, CLAUDE.md, docs/{security,plugin-authoring,operations}):
- Dockerfile HEALTHCHECK now reads `RECOTEM_PORT` at runtime.
- Helm Deployment gains a `preStop sleep` lifecycle hook (default 5 s)
  with `terminationGracePeriodSeconds` bumped to 40 to give kube-proxy
  time to drop the pod from Endpoints before uvicorn starts refusing.
- `RECOTEM_DRAIN_SECONDS` now clamped via `_clamped_int_env` (1..300)
  with a structured warning on clamp; `RECOTEM_PORT` gains 1..65535
  range validation; `RECOTEM_LOG_FORMAT` now raises ValueError on
  unknown values rather than silently falling back to `auto`.
- PyPI publish workflow opts into Sigstore attestations
  (`attestations: write` + `generate-attestations: true`).
- Helm cronjob `recipeFiles` accepts a list (range-quoted) with a
  legacy comma-string fallback (deprecated path).
- docs/security.md FQCN deny-list aligned with `signing.py` (removed
  scipy.sparse._compressed/_data_matrix and other false claims).
- docs/plugin-authoring.md `probe()` typed `-> None` per Protocol.
- CLAUDE.md Quick Start signing-key format aligned with keygen output;
  RECOTEM_ENV row split between `--insecure-no-auth` (development/dev/
  test) and `--dev-allow-unsigned` (development only).
- docs/operations.md key rotation merged Verify into Step 4 to match
  the four-step claim.

Test coverage (+22 new behavioural tests, 870 -> 892):
- atomic write tmp cleanup on OSError / KeyboardInterrupt / SystemExit;
  object-store pointer-file write order; YAML !!python/object/apply
  rejection; verify_sha256 mismatch / pass; watcher concurrent-stat
  bound at 16; watcher add+remove same tick; run_search all-zero
  budget; BigQuery TimeoutError / DeadlineExceeded wrapping; lock
  release on SystemExit; userinfo accepted on object-store schemes
  and rejected on http(s)/ftp(s).
…nforcement + doc/code parity

Three issues from automated PR review (confidence ≥ 80):

1. routes.py: remove `from __future__ import annotations` (CLAUDE.md
   prohibits it for FastAPI dependency-introspection patterns like
   `kid: str = Depends(_require_auth)`). Promote `ApiKeyEntry` from
   TYPE_CHECKING to runtime import since `make_router` needs it eagerly.
   Add regression tests asserting (a) source has no future-annotations
   import and (b) `kid` parameters resolve to the real `str` type with
   `fastapi.params.Depends` defaults.

2. DataSource plugin contract: enforce `no_expand_fields: ClassVar[
   frozenset[str]]` in `validate_plugin_contract`. Previously the
   Protocol declared the attribute as required but the validator never
   checked it and the loader silently fell back to `frozenset()` —
   plugins that forgot the attribute got zero env-var-expansion
   protection on custom fields. Now declare it explicitly on all
   builtins (CSVSource/ParquetSource = frozenset(); BigQuerySource =
   {"query","query_parameters"} for defence-in-depth) and on the
   echo-source example. Document as required in plugin-authoring.md.
   Add 22 unit tests covering missing/wrong-type/valid declarations
   and builtin attribute conformance.

3. _http_fetch.py: fix misleading docstring on `assert_host_public`
   and comment on `_NO_REDIRECT_OPENER`. Both falsely claimed numeric-
   IP hosts cause a None return / use the no-redirect opener; reality
   is the IP is validated and pinned via `_build_pinned_opener`, same
   as a DNS-resolved host. Code behaviour unchanged. Add 3 regression
   tests pinning the actual numeric-IP path so future refactors cannot
   silently weaken the SSRF guard.

Verified: 921 tests pass, 0 regressions; ruff check + format clean.
…vability, defense in depth

Multi-agent audit (security / completeness / silent-failure / test-coverage / general)
surfaced 2 CRITICAL + 15 MAJOR + ~10 MINOR findings. This commit addresses them.

CRITICAL
- C-1: serve-side configuration errors now exit 8 (_EXIT_CONFIG) consistently.
  Introduce ``recotem.config.ConfigError`` and route three previously-misclassified
  paths through it: missing RECOTEM_SIGNING_KEYS (was exit 5), malformed
  RECOTEM_API_KEYS / RECOTEM_PORT (was exit 2), --insecure-no-auth in non-dev env
  (was exit 1). _map_exception_to_exit gains the new branch ahead of ArtifactError.
- C-2: serving/routes.py _lookup_metadata catches the documented error set
  (KeyError, AttributeError, TypeError, ValueError), guards non-string column
  names, logs a structured metadata_lookup_failed event and increments the new
  recotem_metadata_lookup_errors_total counter. Docstring no longer lies.

MAJOR
- M-1: _http_fetch redirect cap now exact (>=, not >). Doc said 5 — code allowed 6.
- M-2: serving/watcher.py recipe-rescan failure on an existing recipe no longer
  removes the loaded model from the registry. The entry's last_load_error is set
  (visible via /health) and recotem_recipe_rescan_errors_total is incremented.
  "Stale-but-loaded keeps serving" availability contract upheld.
- M-3: cli.py SystemExit handling preserves canonical exit codes {0,2..8} from
  library callers; only unknown codes normalise to _EXIT_UNKNOWN.
- M-4: IPython stub installation centralised in recotem._ipython_stub.install(),
  consumed by both _idmap (serving) and training/_compat. Both sys.modules keys
  are checked independently — partial real-IPython installs no longer skip the
  .display stub.
- M-5: ServeConfig.from_env raises ConfigError if RECOTEM_MAX_PAYLOAD_BYTES >
  RECOTEM_MAX_ARTIFACT_BYTES. Invariant from CLAUDE.md is now enforced.
- M-6: artifact/signing.py FQCN deny-list precedence — deny check runs before
  exact-match allow-list, so future allow-list additions cannot accidentally
  re-permit denied submodules.
- M-8: recipe/loader.py output-path containment uses resolved_parent / name
  instead of non-strict terminal resolve(). Comment matches the policy.
- M-9: serving/watcher.py _stat_marker propagates error class via a new
  _stat_marker_with_error companion; /health surfaces "stat failed: <class>"
  distinguishing missing-file from IAM/network failure.
- M-11/M-12 (test additions): object-store ETag swap and recotem-schema plugin
  discovery now exercised.
- M-14: serving/routes.py {name} path param constrained to ^[A-Za-z0-9_-]{1,64}$
  via Annotated[str, Path(pattern=...)]. Bad names return 422 before any handler
  logic, eliminating log-injection / cache-poisoning surface.
- M-15: datasource/bigquery.py Storage API fallback log includes the IAM
  permission name. New RECOTEM_BQ_REQUIRE_STORAGE_API env var disables fallback
  for operators who want to fail fast on permission misconfig.

MINOR
- m-1: serving/auth.py constant-time scrypt dummy uses length-matched input on
  the present-but-rejected paths (closes residual SHA-256 timing leak).
- m-3: _http_fetch builds a fresh no-redirect opener per fetch (no shared mutable
  global state on the allow_private path).
- m-4: _http_fetch.py adds a dedicated socket.timeout / TimeoutError branch with
  the configured timeout in the message; generic catch includes type(exc).__name__.
- m-5: _http_fetch DNS-resolution failure now logged at DEBUG.
- m-6: serving/app.py OpenAPI version reads recotem.version.__version__.
- m-7: ServeConfig.recipes_dir is a real dataclass field (no attribute injection).
- m-8: cli.py hoists stdlib re/uuid/structlog to module top.
- m-9: training/lock.py reads RECOTEM_LOCK_DIR via config.get_lock_dir().
- m-10: recipe/loader.py narrows env-expansion source-class lookup catch —
  DataSourceError silent-fallback retained, ImportError / others surface as
  RecipeError so a plugin loading bug cannot silently degrade no_expand_fields
  protection.
- m-12: ServeConfig.from_env rejects duplicate kids in RECOTEM_API_KEYS
  (mirrors KeyRing).
- m-13: CLAUDE.md documents RECOTEM_DRAIN_SECONDS [1, 300] clamp and adds
  RECOTEM_BQ_REQUIRE_STORAGE_API.
- m-14: docs/operations.md adds trained_at to train_error event row.
- M-10: docs/operations.md metric inventory aligned with the 10 currently-
  registered series (8 prior + 2 new in this round).

Tests
- 964 → 1010+ tests (≈+46 across all agents). uv run pytest tests / ruff check /
  ruff format --check all green. New tests cover every behavioural change above
  including regressions (recipe-deletion still removes entries, KeyError still
  yields {}, train signing-key path still exits 8, etc.).

Deferred to a follow-up PR (non-blocking, structural rather than behavioural)
- M-7: extract _load_signed_artifact helper shared by app.py and watcher.py.
- m-2: auth empty-keys timing equalisation (mitigated by 127.0.0.1 force-bind).
- m-15: lru_cache cache-clear public API for source registry (test ergonomics).
- m-16: ThreadPoolExecutor lifecycle hardening (works under intended use).
- m-17: orphan-thread early-warning at 50% of ceiling (current ceiling triggers).
… observability

- B-1: filter best_params to __init__-accepted keys before _train_final so
  TPESampler/learnt_config overlays no longer raise TypeError after a
  successful 100% search (artifact never written → exit 4 with no
  actionable message); also map ValueError to TrainingError
- B-2: /health returns 503 when degraded so K8s readiness probes mark the
  Pod NotReady — returning 200 let rolling upgrades silently swap in a
  Pod whose every /predict returned 503
- B-4: _check_dev_env and --dev-allow-unsigned flag-pair guards now exit
  with EXIT_CONFIG (8), matching the documented exit-code table and
  sibling "signing keys missing" guard so CronJob retry logic can branch
  on a single value
- B-5: inspect maps fsspec backend exceptions (botocore.NoCredentialsError,
  gcsfs.HttpError, etc. — not OSError subclasses) to EXIT_ARTIFACT (5)
  instead of letting them surface as exit 1 via typer's default handler
- B-7: secret-example.yaml signing-key block now matches keygen output
  (fingerprint=<8 hex>, not hash=sha256:<64 hex>) — operators following
  the example would have put the wrong value in RECOTEM_SIGNING_KEYS

- M-1: expose watcher's read_artifact_bytes / stat_marker / sha256_bytes /
  load_metadata under public names with __all__ so serving/app.py no
  longer imports leading-underscore symbols across the module boundary
- M-5: re-raise MemoryError / RecursionError before broad except in
  watcher._load_recipe, _http_fetch.fetch_http_bytes, and search._learn
  so a long-running thread does not silently retry through OOM
- M-9: forward configurable default_class through make_trial_callback so
  early-trial trial_done events surface a real candidate name instead of
  the literal "unknown" sentinel that pollutes SIEM aggregations

Tests: +15 cases across 5 files (979 pass, 1 deselected) covering
header_json HMAC tamper detection, public-API rebinding, MemoryError
propagation, exit-code mapping, and best_params filter behavior.
…bservability hardening

MAJOR
- M-2: recipe_hash now normalizes pathlib.Path via .as_posix() so hashes
  computed on Windows match POSIX (closes hash-mismatch on cross-platform
  recipe sharing).
- M-4: BigQuery storage-API fallback emits Prometheus counter
  recotem_bigquery_storage_fallback_total{reason} from a neutral top-level
  module (no serving dep). Documented in operations runbook.
- M-6: recipes_dir scan failures emit recotem_recipes_dir_scan_failures_total
  {error_class} so silent per-recipe load failures are observable.
- M-8: LockTimeoutError(LockContestedError) carries waited_seconds and is
  raised with a recipe_lock_timeout warning instead of plain
  LockContestedError; existing exit-code mapping continues to apply via
  isinstance check.

MEDIUM
- S-2: Document that RECOTEM_MAX_DOWNLOAD_BYTES caps raw I/O only and does
  NOT bound the decompressed DataFrame; add operator-mitigation table
  (cgroup MemoryMax / RLIMIT_AS / k8s memory limits).
- S-3: Extend artifact unpickler deny-list to numpy.random.* and
  numpy._core._exceptions.* — neither is needed by irspack's pickle path
  and both expand attack surface through the broad numpy._core.* prefix.
- S-4: Watcher kid log fields routed through _format_kid_for_log (truncate
  to 64 chars, strip non-printable bytes) to block log-DoS / terminal-escape
  injection from malicious artifact headers.

PERFORMANCE
- P-1: Pre-flatten item metadata into dict[str, dict] at load time keyed
  by item_id; deny-list filter applied once; NaN→None for JSON safety.
  /predict swaps DataFrame iteration for O(1) dict.get(). Existing
  metadata_df retained as fallback for legacy stub paths.
- P-2: /predict response models constructed via model_construct (skip
  validation) where inputs are already validated upstream.
- P-3: Startup loads artifacts in ThreadPoolExecutor; max_workers defaults
  to min(len(recipes), 8) and is overridable via
  RECOTEM_STARTUP_PARALLELISM (clamped [1, 32]). Per-recipe failures don't
  block the pool. Emits startup_artifact_load_complete with counts and
  wall_seconds.
- P-4: Watcher checks <output_path>.sha256 sidecar (when present) before
  performing a full artifact re-read; logs pointer_unchanged_skip_read on
  the fast path. Falls back transparently to existing stat marker when no
  sidecar exists.

TESTS
- T-3 / T-4: New test files for evaluator metric mapping and ProgressReporter
  event emission (40+ tests across both).
- T-5: Lifespan drain config tests — default 30s, clamped [1, 300], invalid
  fallback, and shutdown event emission verified.
- Per-fix regression tests added across artifact_signing, datasource_bigquery,
  metadata_loader, serving_app, serving_registry, serving_routes,
  serving_watcher, training_lock, training_pipeline.

Suite: 1059 passed, 1 deselected (slow). Ruff clean.
…ity, watcher non-blocking, OOM let-propagate

CRITICAL
- C-1: _try_load_artifact now builds metadata_index alongside metadata_df,
  matching ArtifactWatcher._build_entry. Without this, item_metadata recipes
  loaded at startup hit the fallback DataFrame .loc path on every /predict
  until the first hot-swap — a silent regression of the round-11 hot-path
  optimisation.
- C-2: recotem inspect read cap is now RECOTEM_MAX_ARTIFACT_BYTES (file
  read) and parse cap is RECOTEM_MAX_PAYLOAD_BYTES (parse_header),
  matching the serving-watcher protocol. Previously inspect used the
  payload cap (default 512 MiB) for both, refusing 512 MiB–2 GiB artifacts
  the serve path accepts.

MAJOR
- M-1: _scan_recipes_dir no longer calls _load_recipe(force=True)
  synchronously on new YAML discovery. The stub entry registered with
  last_marker=None is picked up by the next _poll_artifacts tick via the
  marker-change path, so a slow object-store load can no longer stall the
  watcher loop for every other recipe.
- M-2: recotem schema failure routes through _map_exception_to_exit;
  recotem keygen --type bogus now exits 8 (_EXIT_CONFIG), not 1.
- M-3: inspect except blocks use _map_exception_to_exit and let
  MemoryError/RecursionError propagate, so a malformed
  RECOTEM_SIGNING_KEYS surfaces as exit 5 (ArtifactError) distinctly from
  OOM (which is no longer hidden as exit 5).
- M-4: validate X-Request-ID against ^[A-Za-z0-9_-]{1,64}$, replacing
  invalid values with a uuid4 to prevent log-injection / log-volume
  amplification via unbounded header echo.
- M-5: _scan_recipes_dir iterdir() failure now logs error_class and
  increments recotem_recipes_dir_scan_failures_total{error_class=dir_iter_*}
  instead of warn-and-continue with no metric signal.
- M-6: Watcher carries a _yaml_path_to_name map so a YAML whose declared
  recipe.name differs from its file stem still receives availability
  protection during transient rescan parse errors.
- M-7: doc/code drift — RECOTEM_ENV value range corrected in
  security.md/docker.md, RECOTEM_MAX_PAYLOAD_BYTES added to ops/docker
  env tables with the payload<=artifact invariant, n_orphaned mentioned
  in recipe-reference; CI now runs tests/fuzz so hypothesis-based
  property tests gate merges.
- M-8: (MemoryError, RecursionError): raise across CSV/BigQuery/Parquet
  fetch, recipe loader, metadata loader, training pipeline, artifact
  unpickle, watcher poll, and startup load. OOM is no longer hidden as
  DataSourceError / RecipeError / ArtifactError; operator host metrics
  see a real OOM signal.

OBSERVABILITY / MINOR
- OBS-1: artifact_stat_failed WARN downgrades to DEBUG on repeat of the
  same error_class per recipe; Prometheus counter still rises so alerts
  fire correctly.
- OBS-3: new src/recotem/_log_safe.py exposes format_kid_for_log
  (\xHH-escape + 64-char-truncate). Adopted by both artifact.signing and
  serving.watcher so kid log sanitisation is consistent across the
  module boundary.
- OBS-4: _lookup_metadata short-circuits via meta_df.index pre-check; the
  retained KeyError branch is now reachable only for genuinely corrupt
  index types (debug-logged).
- MIN-1: _BLACKLIST_EXACT uses plural KEYS/API_KEYS (matches real env
  var names).
- MIN-2: config.is_truthy_env public; datasource/bigquery and
  serving/metrics drop their local copies.
- MIN-3..6: dead TYPE_CHECKING block, dead has_env_var_references, dead
  lazy imports inside pipeline.py exception path, and a no-op try/except
  in BigQuery query_parameters assignment removed.

TESTS
- +40 cases (1070 → 1110). New: startup metadata_index (N-1), watcher
  non-blocking discovery (N-3), keygen/schema exit codes (N-4), inspect
  malformed signing keys (N-5), X-Request-ID validation parametrized
  (N-6), iterdir scan failure metric (N-7), yaml_path_to_name rescan
  safety (N-8), MemoryError propagation in csv/recipe/metadata loaders
  (N-9), artifact_stat_failed DEBUG demotion (N-10), _log_safe module
  contract (N-11, 14 cases), metadata index pre-check (N-12),
  is_truthy_env parametrized (N-13, 16 cases), RECOTEM_DRAIN_SECONDS
  clamp boundaries (N-14, 8 cases), startup parallelism real concurrency
  (N-15, 2 cases).
- Existing tests updated for new contracts: inspect cap source,
  MemoryError propagation, kid log format (\xHH escapes, "..." suffix),
  metadata lookup pre-check.
The actual env var (and exact-match blacklist entry in
src/recotem/recipe/envvars.py) is plural. recipe-reference.md and
security.md referred to the singular form; align with the implementation.
…h budget

Six bundles of fixes addressing the multi-agent comprehensive review of
the recotem 2.0 clean rewrite. All 1153 tests pass; ruff check + format
clean.

Lock + CLI:
- training/lock.py: Windows path was permanently stuck after first run
  (O_CREAT|O_EXCL on a persistent sentinel). Switch to msvcrt.locking
  on a persistent sentinel; return fd to caller for release-on-close.
- training/lock.py: wrap POSIX os.open in try so EACCES/ELOOP/ENOSPC
  map to the right error class (LockContestedError vs propagate).
  ELOOP emits recipe_lock_unsafe_symlink warning.
- cli.py: inspect now reports a scheme-specific install hint
  (recotem[gcs/s3/azure]) when an optional fsspec backend is missing.

Auth + log redaction:
- serving/auth.py: equalise scrypt input length across all three
  rejection branches (missing/oversized/short) using a canonical
  _API_KEY_MAX_LEN null-byte dummy — closes the residual timing
  oracle between rejection branches.
- serving/auth.py: kid loop now compares against every configured
  entry (no short-circuit on first match) so kid ordering does not
  leak via response latency.
- log_redaction.py: hex64 / base64url43 value scrubbers switched from
  \\b word boundaries to hex/base64 char lookaround (\\{N,\\}) to
  catch concatenated and URL-embedded key runs. Natural-language
  false positives (\"monkey\", \"turkey\", \"signing_kids\") guarded
  via _KEY_BENIGN_EXACT_NAMES allowlist.

Serving (app/routes/watcher/registry):
- serving/app.py: insecure/dev-unsigned banners now emit only from
  the lifespan task (not also synchronously in create_app) — removes
  the startup double-emit. Shutdown awaits banner_task with
  CancelledError suppression — silences the asyncio Task-destroyed
  warning.
- serving/routes.py: predict's finally uses unbind_contextvars for
  the three keys it bound (recipe, request_id, kid) instead of
  clear_contextvars, preserving any upstream middleware bindings.
- serving/watcher.py: sidecar fast-path now always assigns
  state.last_marker = marker before continue, so eventual-consistent
  object stores cannot pin the watcher to a stale marker.
- serving/registry.py + watcher.py: new replace_with_marker takes
  the registry lock once and assigns entry + _loaded_marker
  atomically — closes the brief stale-marker window between the old
  replace() + update_loaded_marker() pair.

Recipe / config:
- recipe/loader.py: _NO_EXPAND_KEYS membership test now case-folds
  both sides, so a plugin declaring no_expand_fields={\"SQL\"} blocks
  expansion on a YAML key 'sql:' (and vice versa).
- recipe/loader.py: new load_recipes_directory_lenient continues
  past per-file load failures and returns
  list[tuple[Path, Recipe|None, Exception|None]]; existing strict
  loader unchanged.

Training search:
- training/search.py: replaced the O(N²) per-objective study.trials
  scan with O(1) in-memory counters protected by a threading.Lock.
  The check-and-reserve is now atomic, so the per-algorithm budget
  is exactly enforced under n_jobs>1 (no overshoot). Removed the
  now-obsolete per_algorithm_trials_budget_race warning.

Docs / spec drift:
- docs/getting-started.md: keygen --type signing output corrected to
  fingerprint=<8-char hex> (was incorrectly documented as hash=sha256:...).
- docs/security.md / docs/recipe-reference.md: env-expansion blacklist
  rewritten as a complete table — adds PASSWD/AUTH/BEARER/CRED/PRIVATE
  substrings and AZURE_ prefix to match envvars.py.
- docs/security.md: log redaction patterns expanded to include
  *passwd*/*auth*/*bearer*/*cred*/*private* with benign-allowlist note.
- docs/security.md: FQCN deny-list adds numpy.random and
  numpy._core._exceptions to match signing.py.
- examples/plugins/echo-source/README.md + pyproject.toml: clarify
  that entry-point key is a registry/log identifier; the recipe
  discriminator comes from EchoSource.type_name (resolves the
  contradiction with docs/plugin-authoring.md).
- docs/deployment/k8s.md: fix s3 lock path example (\$RECOTEM_LOCK_DIR
  or <tempdir>/recotem-locks/<sha256>.lock) and the
  recipe_lock_local_only WARN-first / DEBUG-thereafter logging cadence.
- docs/deployment/cron.md: extend exit-code switch with cases 6/7/8.

Test coverage additions:
- Windows lock acquired/released/contention; EACCES/ENOSPC/ELOOP
  branches.
- CLI SystemExit and ImportError paths.
- Auth canonical-dummy-length across rejection branches; kid full-scan.
- 12 new log_redaction tests for key/hex64/base64url coverage.
- predict preserves upstream contextvars under a real middleware.
- Banner emits exactly once; banner_task cancelled cleanly.
- Sidecar fast-path updates last_marker.
- replace_with_marker atomicity.
- Hot-swap of a corrupt artifact preserves the stale entry on a
  real ModelRegistry (not Mock).
- Watcher reads artifact bytes exactly once per ETag change cycle.
- Key rotation step 4: artifact signed with a now-retired kid is
  rejected with ArtifactError.
- _write_atomic non-local (object-store) round-trip via memory:// FS.
- BigQuerySource.probe() success + two failure paths.
- Per-algorithm budget exactly enforced under n_jobs=4 (no overshoot).
- load_recipes_directory_lenient per-file error continuation.

Removed test_training_search_parallel_warning.py: the race-window
warning was emitted to acknowledge the previous O(N²) overshoot; the
new atomic-counter implementation eliminates the race itself, so the
warning is correctly no longer emitted. The equivalent positive
assertion lives in test_training_search.py::
test_per_algorithm_budget_not_exceeded_with_parallel_jobs.

Skip decisions (with rationale):
- M5 (nested path validation): no real exploit in current built-ins;
  plugin contract docs already place validation responsibility on
  plugin authors.
- M14 (single-member discriminated union): verified Pydantic v2.13.4
  correctly rejects unknown discriminator values for single-member
  unions; no bug to fix.
- M17 (startup parallelism sentinel): _clamped_int_env clamps user
  input \"0\" to 1, so the sentinel 0 is unambiguously \"auto\".
- M18 (SystemExit propagation): pre-existing handler already only
  honours documented _EXIT_* codes; added regression tests to lock
  this contract in.
…ot path

Critical fixes
- cli.py: recotem inspect accepted Path which collapsed s3:// to s3:/ on
  POSIX; switch to str and add _repair_uri() so remote URIs reach fsspec
  intact. Align missing-signing-keys exit code to _EXIT_CONFIG=8 per
  CLAUDE.md table. Stop swallowing MemoryError/RecursionError in serve.
- _size_cap.py: bare except Exception silently disabled the documented
  download cap on object-store probe failures; narrow to FileNotFoundError
  / PermissionError and surface unexpected failures as SizeCapProbeError.
- _http_fetch.py: redirect Location with embedded userinfo bypassed the
  initial-URL guard and would have leaked Authorization: Basic to the next
  hop; reject after urljoin.
- serving/watcher.py:
  - _extract_kid_safe returned a string sentinel that could collide with
    a real kid; return (kid, reason) tuple with \x00-prefixed unparseable
    fallback that KeyRing cannot match.
  - build_initial_states stored None for both file-missing and stat-failed
    cases; on a transient OSError this froze hot-swap forever. Add a
    distinct sentinel so the next tick always re-evaluates.
  - ThreadPoolExecutor workers blocked SIGTERM through a stuck object-
    store stat; stop() now shutdown(wait=False, cancel_futures=True).
  - _poll_artifacts had no per-future timeout or stop_event check; one
    hung stat blocked the whole tick. Switch to wait(FIRST_COMPLETED)
    with per-future timeout = min(watch_interval, 30s).

Observability + correctness
- training/pipeline.py: train_error event now uses name=/kid= matching
  train_done, restoring SIEM join keys.
- recipe/errors.py + loader.py: RecipeError gains category (security /
  parse / schema / io); lenient loader logs security violations at ERROR
  with a distinct event name. Plugin discovery DataSourceError now
  re-raises instead of silently weakening env-expansion protection.
- serving/watcher.py: single-recipe sidecar failure no longer trips the
  watcher-global _consecutive_errors counter; iterdir() failure marks
  every entry's last_load_error immediately instead of waiting 5 ticks.

Performance hot path
- serving/registry.py: maintain O(1) _loaded_count under the lock instead
  of O(N) walks on every swap; snapshot entries inside the lock and build
  health_dict() outside so /health probes don't block /predict. Cache
  models_dict view per entry. RLock -> Lock (no reentrancy).
- serving/routes.py: bypass pydantic re-serialization on /predict by
  returning JSONResponse; defend against metadata columns named item_id
  or score shadowing recommender values.
- metadata/loader.py: build_metadata_index now uses df.to_dict(orient=
  "index") (~100x faster than iterrows); new MetadataError(cause=...)
  preserves HttpFetchError as __cause__ for CLI exit-code mapping.
- serving/watcher.py: _scan_recipes_dir mtime-caches parsed Recipe
  objects; build_initial_states pre-populates last_sidecar_contents.

Defence in depth
- artifact/io.py: _write_atomic only swallows EINVAL/ENOTSUP from fsync;
  ENOSPC/EIO now propagate.
- datasource/csv.py: validate required user/item/time columns at fetch
  time so missing columns map to _EXIT_DATASOURCE=3 (not _EXIT_UNKNOWN).
- serving/auth.py: anonymous-bypass DEBUG per request + INFO per first
  client_host (LRU bounded to 1024) when RECOTEM_API_KEYS is empty.
- log_redaction.py: bytes/bytearray values are now redacted (was
  str-only) so b"<32 raw bytes>" with key-shaped hex can no longer leak.
- recipe/models.py: Recipe.source: Any now validated via model_validator
  so direct construction (not just load_recipe) cannot accept unknown
  source types.

Refactor
- New _exit_codes.py module shared by cli.py and pipeline.py so the
  exception -> exit code mapping has a single source of truth and
  first-party imports fail loud instead of silently degrading codes.

Documentation
- CLAUDE.md / docs/operations.md / docs/recipe-reference.md /
  docs/security.md / docs/plugin-authoring.md updated to match
  implementation: explicit input/output scheme allow-lists,
  RECOTEM_MAX_ARTIFACT_BYTES clamp range, STARTUP_PARALLELISM sentinel
  semantics, recotem inspect URI support, new structured-log events,
  watcher per-future timeout, MetadataError, anonymous-bypass audit log.

Tests
- 1269 tests pass (was ~1153 before this round). New coverage for every
  fix above plus regression suites for KeyRing rotation gaps, watcher
  noisy-neighbor isolation, registry concurrency, JSONResponse wire
  format, and bytes-value log redaction.
- pyproject.toml filterwarnings ignores pytest.PytestUnraisableException
  Warning to absorb pytest-httpserver socket cleanup on CPython 3.12.
…guards

CRITICAL
- artifact/signing.py: surface ImportError as "required module unavailable"
  so operators distinguish "FQCN allow-listed but dep not installed" from
  the generic deserialization-failed message.
- datasource/bigquery.py: branch Storage API fallback on IAM-shape
  (PermissionDenied / Forbidden, by class name AND message marker).  Non-
  IAM failures (quota 429, 5xx, RetryError) now raise DataSourceError
  instead of retrying via REST and double-billing the same constraint.
- _size_cap.py: treat NoCredentialsError / DefaultCredentialsError and
  other credential-shaped exceptions from fsspec as PermissionError
  equivalents (skip silently — the real read surfaces the error).  Fixes
  the CI failure on test_object_store_uncappable_returns_silently while
  keeping SizeCapProbeError behaviour for genuine backend errors.

MAJOR
- training/pipeline.py: add recipe_hash, n_rows, n_users, n_items to the
  canonical train_done / train_error events so SIEM rules can correlate
  artifact identity and training-set size without joining against the
  artifact header.  Internal-error path now attaches exc_info=True for
  stacktrace grouping in Sentry / DataDog.
- serving/auth.py: retain FIRST matching kid for audit attribution (the
  ConfigError still rejects duplicate hashes today, but the first-match
  policy avoids misattribution if that invariant is ever relaxed).
- serving/watcher.py: evict _yaml_mtime_cache and _yaml_path_to_name
  entries whose Path is no longer in the scan.  Closes a slow leak under
  ConfigMap symlink-swap rotation (`..data` → `..2026_xxx_data`) where
  each rescan produced a fresh Path object but recipe.name stayed.
- metadata/loader.py: replace ValueError with MetadataError(cause=...) at
  5 IO/parse sites.  MetadataError now inherits from ValueError so legacy
  `except ValueError` callers keep working.
- artifact/signing.py: KeyRing refuses kids that look like raw hex key
  material (32+ hex chars) — guards against the foot-gun of pasting key
  bytes into the kid field, where the redaction processor would not scrub
  them from structured `kid=...` log fields.
- recipe/envvars.py: extend cloud-provider blacklist prefixes to include
  ALIYUN_, ALICLOUD_, OCI_, IBM_, DO_, HCLOUD_, DIGITALOCEAN_ so
  identifier-shaped env vars (e.g. OCI_TENANCY_OCID) cannot leak via
  recipe expansion.

OBSERVABILITY
- training/lock.py: log warning on Windows msvcrt unlock failure (errno
  + message) so operators can correlate a stuck train with the unlock
  failure rather than chase phantom contention.
- serving/watcher.py: warn when set_load_error returns False (entry not
  in registry) — should be unreachable but logging it surfaces ordering
  bugs in future refactors.

SKIPPED FROM REVIEW (verified intentional)
- C1 (/metrics no auth): docs/operations.md:381 explicitly designs this
  as "unauthenticated by design — the same posture Prometheus and
  Kubernetes liveness/readiness probes expect".
- L3 (optuna.TrialPruned): existing flow at search.py:416-417 re-raises
  TrialPruned correctly via exc_holder.

TESTS
- Adds 32 new test cases across 7 files covering each fix above.
- Total: 1296 passed, 1 deselected (slow marker).
- ruff lint + format clean.
- tests/unit/test_training_pipeline.py: mark
  test_train_done_event_includes_recipe_hash_and_data_stats as
  @pytest.mark.slow. The test requests movielens_small_df which
  transitively triggers the session-scoped movielens_df fixture, and
  MovieLens100KDataManager() calls input() at construction. Under
  CI's captured stdin this raised OSError before the test body ran.
  Other movielens_small_df callers in this file are already gated by
  @pytest.mark.slow; the round-15 addition was the only one missing
  it.

- pyproject.toml / uv.lock: add urllib3>=2.7.0 security floor and
  bump the lock from 2.6.3 to 2.7.0. Trivy newly flagged
  CVE-2026-44431 and CVE-2026-44432 (both HIGH, fixed in 2.7.0)
  against urllib3 2.6.3 in the runtime image. The trivy step uses
  severity=CRITICAL,HIGH + ignore-unfixed=true, so any fix-available
  HIGH fails the build. Pin style follows the existing starlette
  security floor on line 27.
… doc parity

Serving (availability):
- wire load_recipes_directory_lenient so a single broken YAML no longer aborts startup
- watcher auto-recovers "watcher unhealthy" sentinel after consecutive errors drop
- gate /docs and /openapi.json by RECOTEM_ENV={production,prod,staging}
- /metrics hidden from OpenAPI schema (include_in_schema=False)
- security.posture always emitted via try/finally with signing_key_status field
- predict 503 paths emit recipe_unavailable reason (no_entry/not_loaded/recommender_none)
- predict uncaught exceptions surface predict_handler_unexpected_error log
- artifact_post_hmac_deserialize_failed + repeated_post_hmac_failure streak event

Core hardening:
- Recipe.source strict validation (non-dict/non-BaseModel raises ValidationError)
- apply_auth_posture emits host_forced_to_loopback warning when overriding host
- unpickle_payload splits exception handling — dep-mismatch errors propagate
- KeyRing emits signing_keyring_built / signing_keyring_invalid audit events
- BigQuery storage fallback records iam_detected_via (http_403/class/mro/message)
- --lock-timeout CLI flag wired through pipeline; LockTimeoutError now reachable
- ProgressReporter emits tuning_aborted on error exit (separate from tuning_complete)
- _size_cap credential-skip path emits info log with redacted URL userinfo
- recipe envvars emits recipe_env_var_empty warning on empty expansion
- log_redaction safety net keeps event when redaction internals fail
- watcher sidecar OSError observed (DEBUG for ENOENT, WARN otherwise)
- _lookup_metadata KeyError path elevated to WARN + counter increment

Docs:
- operations.md: env var reference table; --lock-timeout flag; corrected /health curl
- recipe-reference.md: KEY substring (not _KEY_); output.path white-list; GCP_ note
- security.md: signing_key_status field; docs gating note; env-expansion clarification
- README.md: quickstart syncs on /health before first curl
- deployment/k8s.md: Service port unified to 8080
- CLAUDE.md: cli.py deferred-import exception documented
- docs/adr/0001-artifact-format-versioning.md (new)
- docs/adr/0002-module-boundary.md (new)

Tests:
- 60+ new cases across serving/recipe/artifact/config/cli/training/bigquery/size_cap/
  log_redaction/training_progress; one new integration scenario for broken-YAML
  startup resilience; new tests/unit/test_recipe_envvars.py file.

CI: ruff/format clean; 1348 pytest pass (1328 unit + 20 integration); no regression.
ADR files added in the previous commit (0001 artifact format versioning,
0002 module boundary) duplicated content already in docs/operations.md
and docs/security.md, and introduced a new convention the project does
not otherwise use. The substantive guidance remains in those docs and
in CLAUDE.md; only the ADR pages and their inbound links are removed.
… OOM propagation

Apply 30+ fixes derived from a 6-agent parallel review (code-reviewer /
silent-failure-hunter / pr-test-analyzer / security-engineer /
performance-engineer / doc-parity). Each fix has corresponding tests added.

CRITICAL
- C-1: close the non-sha256 MAX_DOWNLOAD_BYTES bypass — tighten _size_cap
  credential-skip to exact-match (Author/Authorize false positives removed)
  and switch csv/parquet/metadata loaders' non-sha256 paths to a uniform
  fh.read(cap+1) + BytesIO pattern (was the only direct DoS surface left).
- C-2: guard Optuna SQLite + parallelism>1 (autocommit + thread-shared
  connection corrupts the study DB) — auto-downgrade to parallelism=1
  with an env_var_clamped warning; in-memory storage unaffected.
- C-3: stop wrapping MemoryError as ArtifactError in watcher
  _read_artifact_bytes; the watcher elsewhere already enforces strict
  OOM-propagation. Mirror the same fix in training/split.py (I-14) and
  training/pipeline.py time_column parse (I-13).

IMPORTANT — security / availability
- I-3: split /health (aggregate {status,total,loaded} only) from
  /health/details (per-recipe info, auth-gated). Stops anonymous kid /
  best_class / trained_at leakage. Breaking for direct /health parsers;
  k8s probes (status-code only) unaffected.
- I-4: flip OpenAPI /docs to fail-secure — only enabled when
  RECOTEM_ENV in {development,dev,test}; unset env now disables.
- I-8: /health watcher-unhealthy sentinel now clears on unloaded stubs
  too (previously stuck at 503 forever after watcher recovery).
- I-9: rescan-time YAML parse failures now insert a ModelEntry stub
  (loaded=False, last_load_error=…) and seed _yaml_path_to_name so they
  surface on /health/details; startup-time stubs likewise pre-seed the
  watcher path-to-name map.
- I-10: sidecar read OSError now returns True (reload) for non-ENOENT;
  ENOENT keeps returning False. Failed reload surfaces via /health.
- I-11: per-request metadata-lookup failures now set
  X-Recotem-Metadata-Degraded: 1 (was silent empty-dict before).
- I-16: reject embedded userinfo in s3://, abfs://, abfss:// URIs.
  gs://project@bucket/ stays allowed (documented GCS convention).

IMPORTANT — exit-code / observability
- I-1: add app-level Exception handler returning structured
  {code: internal_error}. HTTPException keeps FastAPI default behavior.
- I-2: trial_learn_failed warning now emits class_name / error_class /
  error (failed-trial diagnostics finally reach SIEMs).
- I-5: drop the defensive except Exception in _http_fetch — let
  AttributeError/TypeError/ImportError propagate (was wrapped as
  HttpFetchError → exit 7).
- I-6: inspect JSONDecodeError now maps to exit 5 (ARTIFACT) instead of
  exit 1 (UNKNOWN); doc/operations.md was already documenting exit 5.
- I-7: serve OSError → discriminate by errno. Only
  EADDRINUSE/EACCES/EADDRNOTAVAIL → exit 8 (CONFIG); ENOMEM/EMFILE etc.
  → exit 1 (UNKNOWN) via _map_exception_to_exit.
- I-12: pipeline _fetch_data discriminator-missing now raises RecipeError
  (exit 2) instead of TrainingError. Unexpected fetch errors get a
  datasource_unexpected_error log line before wrapping.
- I-15: align format.DEFAULT_MAX_PAYLOAD_BYTES to 512 MiB (was 2 GiB,
  contradicting config.py + docs); docstring + tests updated.
- I-18: training/lock.py Windows path discriminates OSError errno —
  only EACCES/EAGAIN are contested, others warn + re-raise.
- I-19: recipe models source-registry validator now emits
  source_registry_unavailable_during_validation warning instead of
  swallowing import failures.
- I-20: metadata loader SizeCapExceededError/SizeCapProbeError now
  raise MetadataError(cause="io") (was raw ValueError → exit 1).
- I-21: --lock-timeout rejects negative values other than -1.0 at CLI
  and via lock.py defensive assert (-10 used to mean "indefinite wait"
  silently).
- I-22: BigQuery ImportError path now honors RECOTEM_BQ_REQUIRE_STORAGE_API
  and raises DataSourceError instead of silently falling back to REST
  (avoids surprise billing).

Docs / consistency
- I-23: bigquery.md fallback-policy section rewritten — IAM-shape only,
  with explicit RECOTEM_BQ_REQUIRE_STORAGE_API description.
- I-24: csv.md replace "Any fsspec-supported scheme" with explicit
  allow-list to match recipe-reference.md and the implementation.
- I-25: recipe-reference.md + loader.py docstring now document the
  strict (train) vs lenient (serve, skip + recipe_duplicate_name_skipped
  WARN) duplicate-name behavior.
- I-26: README quickstart curl drops the trailing slash on /health
  (route registered without trailing slash, redirect_slashes loops
  without -L).
- I-27: re-export validate_for_filesystem from recotem.recipe so
  docs/recipe-reference.md's import example works.

Test additions
- +3000 LOC across ~30 test files. New tests cover every modified
  branch: SQLite-downgrade, fh.read(cap+1) overruns, MemoryError
  propagation in watcher/split/pipeline, /health vs /health/details
  contract, OpenAPI fail-secure, errno-discriminated serve exit
  codes, AttributeError propagation through _http_fetch, lock-timeout
  CLI validation, Windows lock errno discrimination, plugin-import
  warning, BigQuery REQUIRE_STORAGE_API + ImportError, embedded-cred
  rejection across s3/abfs/abfss, 512 MiB default cap, validate_for_filesystem
  re-export, metadata degraded header, watcher sentinel stub clear,
  rescan stub insertion + recovery, sidecar non-ENOENT reload path.

Verification
- uv run pytest tests : 1420 passed, 2 deselected (slow)
- uv run ruff check src tests : clean
- uv run ruff format --check src tests : clean

Scoping notes
- I-17 (pointer-file chunked read) dropped: fsspec S3 backend already
  honors fh.read(N) via HTTP Range, so the theoretical OOM does not
  materialize.
- CR-2 (SIGTERM during startup) judged non-blocking: startup load runs
  before uvicorn installs its signal handler, so Python's default
  SIGTERM behavior already terminates immediately.
- Performance items (PERF-CR-1..4, PERF-IM-*) and remaining SUGGESTIONS
  deferred to a follow-up; this commit is scoped to correctness +
  observability fixes that should ship with the rewrite.
test_read_artifact_bytes_lets_memory_error_propagate and
test_read_artifact_bytes_wraps_os_error_in_artifact_error mutated
fs.open on the fsspec-cached LocalFileSystem singleton without
restoring it. After either test ran, every fsspec.open() in the
same process returned a _GeneratorContextManager, crashing the
integration roundtrip tests with
"'_GeneratorContextManager' object has no attribute 'close'".

Wrap each test body in try/finally that does del fs.open on the
captured filesystem instance(s).
- operations.md: signing-key rotation step 4 was pointing at /health, whose
  response is `{status, total, loaded}` only — the `jq '.recipes...'`
  pipeline always returned empty. Switch to /health/details with X-API-Key.
- deployment/docker.md: HEALTHCHECK description now reflects the Dockerfile
  (RECOTEM_PORT-aware, urlopen timeout=3) and notes the Compose probe's
  timeout=5.
- deployment/k8s.md: clarify that the Helm CronJob template does not set
  backoffLimit (operator must add it); only activeDeadlineSeconds: 3600 is
  bundled.
- deployment/k8s.md: note that the bundled Helm chart's
  terminationGracePeriodSeconds default is 40 (5 preStop + 30 drain + 5),
  not the 35 used in the manual example.
- examples/quickstart/README.md: replace misleading <signing-plaintext>
  placeholder with <signing-hex64> — signing keys are raw hex, not a
  plaintext-vs-hash pair.
- recipe-reference.md: separate the two reasons output.path rejects extra
  schemes (http/https/ftp/ftps not supported for writes; memory:// is
  process-local and non-persistent).
@marevol marevol self-assigned this May 14, 2026
@marevol marevol merged commit 8395a81 into main May 14, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants