docs(architecture): fix health states, complete IPC protocol, document missing features

michaeldwan · michaeldwan · commit df1fd7d43a94 · 2026-03-20T14:28:14.000-06:00
Fix health state machine to start at UNKNOWN (not STARTING), add
UNHEALTHY state for user-defined healthchecks. Complete IPC protocol
tables with all message variants (~8 were missing). Document four
previously undocumented features: input spilling, file outputs, custom
metrics, user-defined healthchecks. Expand env var table and Go package
listing. Add hidden CLI commands. Fix crates/README.md duplicate line.

Update skill to prefer important packages over exhaustive listings.
diff --git a/architecture/03-prediction-api.md b/architecture/03-prediction-api.md
@@ -137,11 +137,13 @@ The `/health-check` endpoint always returns HTTP 200 with the status in the JSON
 
 | State | JSON `status` | Condition |
 |-------|---------------|-----------|
-| `STARTING` | `"STARTING"` | Worker subprocess initializing |
+| `UNKNOWN` | `"UNKNOWN"` | Process just started, not yet serving |
+| `STARTING` | `"STARTING"` | Worker subprocess initializing, running setup() |
 | `READY` | `"READY"` | Worker ready, slots available |
 | `BUSY` | `"BUSY"` | All slots occupied (backpressure) |
 | `SETUP_FAILED` | `"SETUP_FAILED"` | `setup()` threw exception |
 | `DEFUNCT` | `"DEFUNCT"` | Fatal error, worker crashed |
+| `UNHEALTHY` | `"UNHEALTHY"` | User-defined healthcheck failed (transient) |
 
 When all concurrency slots are occupied, new predictions receive `409 Conflict` instead of queuing. Clients should implement retry with backoff.
 
diff --git a/architecture/04-container-runtime.md b/architecture/04-container-runtime.md
@@ -112,33 +112,66 @@ The `Prediction` struct is itself a state machine -- its mutation methods (`set_
 
 ## Worker Subprocess Protocol
 
-Communication between the Rust server and Python worker uses two channels:
+Communication between the Rust server and Python worker uses two channels. All messages are JSON, one per line.
 
-### Control Channel (stdin/stdout -- JSON framed)
+### Control Channel (stdin/stdout)
 
-| Parent → Child | Child → Parent |
-|----------------|----------------|
-| `Init { predictor_ref, num_slots, ... }` | `Ready { slots, schema }` |
-| `Cancel { slot }` | `Log { source, data }` |
-| `Shutdown` | `Idle { slot }` |
-| | `Failed { slot, error }` |
-| | `ShuttingDown` |
+Lifecycle messages for the worker as a whole.
 
-### Slot Channel (Unix socket per slot -- JSON framed)
+**Parent → Worker:**
 
-| Parent → Child | Child → Parent |
-|----------------|----------------|
-| `Predict { id, input }` | `Log { data }` |
-| | `Output { value }` (streaming) |
-| | `Done { output }` |
-| | `Failed { error }` |
-| | `Cancelled` |
+| Message | Purpose |
+|---------|---------|
+| `Init { predictor_ref, num_slots, is_train, is_async, ... }` | Bootstrap worker -- load predictor, create slots |
+| `Cancel { slot }` | Cancel a running prediction on a slot |
+| `Healthcheck { id }` | Request a user-defined healthcheck |
+| `Shutdown` | Graceful shutdown |
+
+**Worker → Parent:**
+
+| Message | Purpose |
+|---------|---------|
+| `Ready { slots, schema }` | Worker initialized, here are the slot IDs and OpenAPI schema |
+| `Log { source, data }` | Setup-time log line (stdout or stderr) |
+| `WorkerLog { target, level, message }` | Structured log from the worker runtime itself (not user code) |
+| `Idle { slot }` | Slot finished a prediction and is available |
+| `Cancelled { slot }` | Prediction on slot was cancelled |
+| `Failed { slot, error }` | Prediction on slot failed |
+| `Fatal { reason }` | Unrecoverable error -- worker is shutting down |
+| `DroppedLogs { count, interval_millis }` | Worker dropped log messages due to backpressure |
+| `HealthcheckResult { id, status, error }` | Result of a user-defined healthcheck |
+| `ShuttingDown` | Worker is shutting down |
+
+### Slot Channel (Unix socket per slot)
+
+Per-prediction data. Using separate sockets per slot avoids head-of-line blocking between concurrent predictions.
+
+**Parent → Worker:**
+
+| Message | Purpose |
+|---------|---------|
+| `Predict { id, input, input_file, output_dir }` | Run a prediction. `input` is inline JSON; for large payloads (>6MiB) it's `null` and `input_file` points to a spill file on disk |
+
+**Worker → Parent:**
+
+| Message | Purpose |
+|---------|---------|
+| `Log { source, data }` | Log line from predict() |
+| `Output { output }` | Yielded output value (for generators/streaming) |
+| `FileOutput { filename, kind, mime_type }` | File produced by predict() -- referenced by path, uploaded by parent |
+| `Metric { name, value, mode }` | Custom metric (mode: `replace`, `increment`, or `append`) |
+| `Done { id, output, predict_time, is_stream }` | Prediction completed successfully |
+| `Failed { id, error }` | Prediction failed |
+| `Cancelled { id }` | Prediction was cancelled |
 
 ## Health State Machine
 
 ```mermaid
 stateDiagram-v2
-    [*] --> STARTING: Container start
+    [*] --> UNKNOWN: Process starts
+    note right of UNKNOWN: Predictions return 503
+    
+    UNKNOWN --> STARTING: serve() called
     note right of STARTING: Predictions return 503
     
     STARTING --> READY: setup() succeeds
@@ -157,6 +190,8 @@ stateDiagram-v2
     DEFUNCT --> [*]
 ```
 
+There's a distinction between internal health state (`Health` enum) and what the HTTP response returns (`HealthResponse`). The HTTP response adds one extra state: `UNHEALTHY`, which is transient -- it's returned when a user-defined healthcheck fails but doesn't change the internal health state. See [User-Defined Healthchecks](#user-defined-healthchecks) below.
+
 ## Prediction Flow
 
 ### Sync Request (POST /predictions)
@@ -302,13 +337,42 @@ How coglet gets invoked when running a Cog container:
 - **Observable**: Easy to monitor slot usage
 - **Simple**: No async complexity in worker subprocess
 
+## Input Spilling
+
+When a prediction input exceeds 6MiB, it's too large to send inline through the IPC socket. Instead, the parent writes it to a temporary file and sends the file path in `input_file` (with `input` set to null). The worker reads the file, deletes it, and proceeds normally. This is transparent to the predictor code.
+
+## File Outputs
+
+When predict() produces file outputs (`cog.Path`), the worker sends a `FileOutput` message with the filename and MIME type. The parent handles uploading the file (or base64-encoding it for inline responses). The `output_dir` field in the `Predict` request tells the worker where to write output files.
+
+`FileOutputKind` distinguishes between normal file outputs (`FileType`) and oversized outputs (`Oversized`) that exceeded an inline size limit.
+
+## Custom Metrics
+
+Models can record custom metrics via `self.record_metric(name, value, mode)` in their predict method. These are sent as `Metric` messages on the slot channel. The `mode` controls how metrics aggregate:
+
+- `replace` -- overwrite any existing value
+- `increment` -- add to the current value (numeric)
+- `append` -- append to a list
+
+Metrics appear in the prediction response's `metrics` object alongside the built-in `predict_time`.
+
+## User-Defined Healthchecks
+
+Models can implement a custom healthcheck that runs alongside the built-in health state machine. The parent sends `Healthcheck { id }` on the control channel; the worker runs the user's healthcheck and responds with `HealthcheckResult { id, status, error }`.
+
+If the healthcheck fails, the HTTP `/health-check` endpoint returns `UNHEALTHY` -- but this is transient and doesn't change the internal `Health` state. The model stays `READY` and continues accepting predictions.
+
 ## Environment Variables
 
 | Variable | Default | Purpose |
 |----------|---------|---------|
 | `PORT` | 5000 | HTTP server port |
-| `COG_LOG_LEVEL` | INFO | Logging verbosity |
+| `COG_LOG_LEVEL` | INFO | Logging verbosity (ignored if `RUST_LOG` is set) |
 | `COG_MAX_CONCURRENCY` | 1 | Number of concurrent prediction slots |
+| `COG_SETUP_TIMEOUT` | none | Setup timeout in seconds (0 is ignored) |
+| `COG_THROTTLE_RESPONSE_INTERVAL` | 0.5s | Webhook response throttling interval |
+| `LOG_FORMAT` | json | Set to `console` for human-readable log output |
 
 ## Where to Look
 
diff --git a/architecture/06-cli.md b/architecture/06-cli.md
@@ -169,6 +169,18 @@ Stores credentials for `cog push`.
 
 **Code**: `pkg/cli/login.go`
 
+---
+
+### Hidden / Internal Commands
+
+These commands exist but are hidden from `cog --help`:
+
+- **`cog debug`** -- Generates the Dockerfile from cog.yaml without building (useful for debugging build issues)
+- **`cog inspect`** -- Inspects model images and OCI indices
+- **`cog weights`** -- Parent command for `weights build`, `weights push`, `weights inspect`
+
+There's also a separate `base-image` binary (`cmd/base-image/`) with subcommands for managing Cog base images (`dockerfile`, `build`, `generate-matrix`). This isn't a `cog` subcommand.
+
 ## How CLI Commands Interact with Containers
 
 Commands like `predict`, `train`, and `serve` follow the same pattern: build an image, start a container, communicate via HTTP. The CLI never runs model code directly.
@@ -221,21 +233,31 @@ pkg/cli/
 └── init.go         # cog init
 ```
 
-Commands delegate to packages:
-- `pkg/image/` - Image building
-- `pkg/dockerfile/` - Dockerfile generation
-- `pkg/docker/` - Docker client operations
-- `pkg/config/` - cog.yaml parsing
-- `pkg/web/` - Replicate API client
-- `pkg/predict/` - Local prediction execution
-
-## Code References
-
-| File | Purpose |
-|------|---------|
-| `pkg/cli/root.go` | Command registration |
-| `pkg/cli/build.go` | Build command |
-| `pkg/cli/predict.go` | Predict command, input parsing |
-| `pkg/cli/push.go` | Push command |
-| `pkg/image/build.go` | Build orchestration |
-| `pkg/predict/predictor.go` | Local prediction client |
+Commands delegate to packages under `pkg/`:
+
+**Core:**
+- `pkg/cli/` -- Cobra command definitions
+- `pkg/config/` -- cog.yaml parsing and validation, compatibility matrices
+- `pkg/image/` -- Build orchestration (ties together config, Dockerfile generation, schema gen)
+- `pkg/dockerfile/` -- Dockerfile generation and base image selection
+- `pkg/docker/` -- Docker client operations
+- `pkg/predict/` -- Local prediction execution (talks to container's HTTP API)
+- `pkg/schema/` -- Static schema generator (tree-sitter, experimental)
+- `pkg/wheels/` -- SDK and coglet wheel resolution
+
+**Infrastructure:**
+- `pkg/web/` -- Replicate API client (for `cog push`)
+- `pkg/http/` -- Authenticated HTTP transport
+- `pkg/registry/` -- OCI/Docker registry client
+- `pkg/model/` -- OCI artifact domain model
+- `pkg/weights/` -- Weight file discovery and checksums
+- `pkg/errors/` -- `CodedError` for user-facing errors with error codes
+
+**Utilities:**
+- `pkg/dockercontext/` -- Docker build context directory management
+- `pkg/dockerignore/` -- `.dockerignore` parsing
+- `pkg/requirements/` -- `requirements.txt` parsing
+- `pkg/env/` -- `R8_*` environment variable config
+- `pkg/update/` -- CLI version update checker
+- `pkg/global/` -- Build-time metadata, process-wide config
+- `pkg/provider/` -- Abstracts registry-specific behavior for push workflows
diff --git a/crates/README.md b/crates/README.md
@@ -161,8 +161,11 @@ crates/
 │       ├── service.rs      # PredictionService
 │       ├── webhook.rs      # WebhookSender (retry, trace context)
 │       ├── version.rs      # Version info
-│       ├── webhook.rs      # Webhook sender
 │       ├── orchestrator.rs # Worker lifecycle, event loop (parent)
+│       ├── fd_redirect.rs  # File descriptor redirection
+│       ├── input_validation.rs # Input validation against schema
+│       ├── setup_log_accumulator.rs # Accumulates logs during setup()
+│       ├── worker_tracing_layer.rs  # Tracing layer for worker process
 │       ├── worker.rs       # Worker event loop (child)
 │       ├── bridge/
 │       │   ├── mod.rs
@@ -191,7 +194,9 @@ crates/
         ├── output.rs       # Output serialization
         ├── log_writer.rs   # SlotLogWriter, ContextVar routing
         ├── audit.rs        # Audit hook, TeeWriter
-        └── cancel.rs       # Cancellation support
+        ├── cancel.rs       # Cancellation support
+        ├── metric_scope.rs # Scope and MetricRecorder for record_metric()
+        └── bin/stub_gen.rs # Type stub generator
 ```
 
 ## Bridge Protocol
diff --git a/skills/updating-architecture-docs/SKILL.md b/skills/updating-architecture-docs/SKILL.md
@@ -40,6 +40,8 @@ The first gives you the mental model. The second just restates the code. A reade
 
 Reference source locations at the **package/directory level** with a description of what that package owns. Specific file paths and line numbers rot as code moves around. A pointer like "`crates/coglet/src/bridge/` -- IPC protocol and transport" stays accurate through refactors. "`bridge/protocol.rs:69` -- ControlRequest enum" doesn't.
 
+Only document packages that matter for understanding the system's shape. Generic utility packages (`pkg/util/`, `pkg/path/`, etc.) don't need a mention -- their existence is obvious and they don't help a reader build a mental model. If someone needs them, they'll find them.
+
 When a specific file reference is genuinely useful (a key entry point, a non-obvious starting point for understanding a subsystem), include it -- but prefer "the `PredictionService` in `service.rs`" over a line number.
 
 ### Document boundaries, not internals