Skip to content

Commit df1fd7d

Browse files
committed
docs(architecture): fix health states, complete IPC protocol, document missing features
Fix health state machine to start at UNKNOWN (not STARTING), add UNHEALTHY state for user-defined healthchecks. Complete IPC protocol tables with all message variants (~8 were missing). Document four previously undocumented features: input spilling, file outputs, custom metrics, user-defined healthchecks. Expand env var table and Go package listing. Add hidden CLI commands. Fix crates/README.md duplicate line. Update skill to prefer important packages over exhaustive listings.
1 parent 60f4a70 commit df1fd7d

File tree

5 files changed

+135
-40
lines changed

5 files changed

+135
-40
lines changed

architecture/03-prediction-api.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,11 +137,13 @@ The `/health-check` endpoint always returns HTTP 200 with the status in the JSON
137137

138138
| State | JSON `status` | Condition |
139139
|-------|---------------|-----------|
140-
| `STARTING` | `"STARTING"` | Worker subprocess initializing |
140+
| `UNKNOWN` | `"UNKNOWN"` | Process just started, not yet serving |
141+
| `STARTING` | `"STARTING"` | Worker subprocess initializing, running setup() |
141142
| `READY` | `"READY"` | Worker ready, slots available |
142143
| `BUSY` | `"BUSY"` | All slots occupied (backpressure) |
143144
| `SETUP_FAILED` | `"SETUP_FAILED"` | `setup()` threw exception |
144145
| `DEFUNCT` | `"DEFUNCT"` | Fatal error, worker crashed |
146+
| `UNHEALTHY` | `"UNHEALTHY"` | User-defined healthcheck failed (transient) |
145147

146148
When all concurrency slots are occupied, new predictions receive `409 Conflict` instead of queuing. Clients should implement retry with backoff.
147149

architecture/04-container-runtime.md

Lines changed: 83 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -112,33 +112,66 @@ The `Prediction` struct is itself a state machine -- its mutation methods (`set_
112112

113113
## Worker Subprocess Protocol
114114

115-
Communication between the Rust server and Python worker uses two channels:
115+
Communication between the Rust server and Python worker uses two channels. All messages are JSON, one per line.
116116

117-
### Control Channel (stdin/stdout -- JSON framed)
117+
### Control Channel (stdin/stdout)
118118

119-
| Parent → Child | Child → Parent |
120-
|----------------|----------------|
121-
| `Init { predictor_ref, num_slots, ... }` | `Ready { slots, schema }` |
122-
| `Cancel { slot }` | `Log { source, data }` |
123-
| `Shutdown` | `Idle { slot }` |
124-
| | `Failed { slot, error }` |
125-
| | `ShuttingDown` |
119+
Lifecycle messages for the worker as a whole.
126120

127-
### Slot Channel (Unix socket per slot -- JSON framed)
121+
**Parent → Worker:**
128122

129-
| Parent → Child | Child → Parent |
130-
|----------------|----------------|
131-
| `Predict { id, input }` | `Log { data }` |
132-
| | `Output { value }` (streaming) |
133-
| | `Done { output }` |
134-
| | `Failed { error }` |
135-
| | `Cancelled` |
123+
| Message | Purpose |
124+
|---------|---------|
125+
| `Init { predictor_ref, num_slots, is_train, is_async, ... }` | Bootstrap worker -- load predictor, create slots |
126+
| `Cancel { slot }` | Cancel a running prediction on a slot |
127+
| `Healthcheck { id }` | Request a user-defined healthcheck |
128+
| `Shutdown` | Graceful shutdown |
129+
130+
**Worker → Parent:**
131+
132+
| Message | Purpose |
133+
|---------|---------|
134+
| `Ready { slots, schema }` | Worker initialized, here are the slot IDs and OpenAPI schema |
135+
| `Log { source, data }` | Setup-time log line (stdout or stderr) |
136+
| `WorkerLog { target, level, message }` | Structured log from the worker runtime itself (not user code) |
137+
| `Idle { slot }` | Slot finished a prediction and is available |
138+
| `Cancelled { slot }` | Prediction on slot was cancelled |
139+
| `Failed { slot, error }` | Prediction on slot failed |
140+
| `Fatal { reason }` | Unrecoverable error -- worker is shutting down |
141+
| `DroppedLogs { count, interval_millis }` | Worker dropped log messages due to backpressure |
142+
| `HealthcheckResult { id, status, error }` | Result of a user-defined healthcheck |
143+
| `ShuttingDown` | Worker is shutting down |
144+
145+
### Slot Channel (Unix socket per slot)
146+
147+
Per-prediction data. Using separate sockets per slot avoids head-of-line blocking between concurrent predictions.
148+
149+
**Parent → Worker:**
150+
151+
| Message | Purpose |
152+
|---------|---------|
153+
| `Predict { id, input, input_file, output_dir }` | Run a prediction. `input` is inline JSON; for large payloads (>6MiB) it's `null` and `input_file` points to a spill file on disk |
154+
155+
**Worker → Parent:**
156+
157+
| Message | Purpose |
158+
|---------|---------|
159+
| `Log { source, data }` | Log line from predict() |
160+
| `Output { output }` | Yielded output value (for generators/streaming) |
161+
| `FileOutput { filename, kind, mime_type }` | File produced by predict() -- referenced by path, uploaded by parent |
162+
| `Metric { name, value, mode }` | Custom metric (mode: `replace`, `increment`, or `append`) |
163+
| `Done { id, output, predict_time, is_stream }` | Prediction completed successfully |
164+
| `Failed { id, error }` | Prediction failed |
165+
| `Cancelled { id }` | Prediction was cancelled |
136166

137167
## Health State Machine
138168

139169
```mermaid
140170
stateDiagram-v2
141-
[*] --> STARTING: Container start
171+
[*] --> UNKNOWN: Process starts
172+
note right of UNKNOWN: Predictions return 503
173+
174+
UNKNOWN --> STARTING: serve() called
142175
note right of STARTING: Predictions return 503
143176
144177
STARTING --> READY: setup() succeeds
@@ -157,6 +190,8 @@ stateDiagram-v2
157190
DEFUNCT --> [*]
158191
```
159192

193+
There's a distinction between internal health state (`Health` enum) and what the HTTP response returns (`HealthResponse`). The HTTP response adds one extra state: `UNHEALTHY`, which is transient -- it's returned when a user-defined healthcheck fails but doesn't change the internal health state. See [User-Defined Healthchecks](#user-defined-healthchecks) below.
194+
160195
## Prediction Flow
161196

162197
### Sync Request (POST /predictions)
@@ -302,13 +337,42 @@ How coglet gets invoked when running a Cog container:
302337
- **Observable**: Easy to monitor slot usage
303338
- **Simple**: No async complexity in worker subprocess
304339

340+
## Input Spilling
341+
342+
When a prediction input exceeds 6MiB, it's too large to send inline through the IPC socket. Instead, the parent writes it to a temporary file and sends the file path in `input_file` (with `input` set to null). The worker reads the file, deletes it, and proceeds normally. This is transparent to the predictor code.
343+
344+
## File Outputs
345+
346+
When predict() produces file outputs (`cog.Path`), the worker sends a `FileOutput` message with the filename and MIME type. The parent handles uploading the file (or base64-encoding it for inline responses). The `output_dir` field in the `Predict` request tells the worker where to write output files.
347+
348+
`FileOutputKind` distinguishes between normal file outputs (`FileType`) and oversized outputs (`Oversized`) that exceeded an inline size limit.
349+
350+
## Custom Metrics
351+
352+
Models can record custom metrics via `self.record_metric(name, value, mode)` in their predict method. These are sent as `Metric` messages on the slot channel. The `mode` controls how metrics aggregate:
353+
354+
- `replace` -- overwrite any existing value
355+
- `increment` -- add to the current value (numeric)
356+
- `append` -- append to a list
357+
358+
Metrics appear in the prediction response's `metrics` object alongside the built-in `predict_time`.
359+
360+
## User-Defined Healthchecks
361+
362+
Models can implement a custom healthcheck that runs alongside the built-in health state machine. The parent sends `Healthcheck { id }` on the control channel; the worker runs the user's healthcheck and responds with `HealthcheckResult { id, status, error }`.
363+
364+
If the healthcheck fails, the HTTP `/health-check` endpoint returns `UNHEALTHY` -- but this is transient and doesn't change the internal `Health` state. The model stays `READY` and continues accepting predictions.
365+
305366
## Environment Variables
306367

307368
| Variable | Default | Purpose |
308369
|----------|---------|---------|
309370
| `PORT` | 5000 | HTTP server port |
310-
| `COG_LOG_LEVEL` | INFO | Logging verbosity |
371+
| `COG_LOG_LEVEL` | INFO | Logging verbosity (ignored if `RUST_LOG` is set) |
311372
| `COG_MAX_CONCURRENCY` | 1 | Number of concurrent prediction slots |
373+
| `COG_SETUP_TIMEOUT` | none | Setup timeout in seconds (0 is ignored) |
374+
| `COG_THROTTLE_RESPONSE_INTERVAL` | 0.5s | Webhook response throttling interval |
375+
| `LOG_FORMAT` | json | Set to `console` for human-readable log output |
312376

313377
## Where to Look
314378

architecture/06-cli.md

Lines changed: 40 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,18 @@ Stores credentials for `cog push`.
169169

170170
**Code**: `pkg/cli/login.go`
171171

172+
---
173+
174+
### Hidden / Internal Commands
175+
176+
These commands exist but are hidden from `cog --help`:
177+
178+
- **`cog debug`** -- Generates the Dockerfile from cog.yaml without building (useful for debugging build issues)
179+
- **`cog inspect`** -- Inspects model images and OCI indices
180+
- **`cog weights`** -- Parent command for `weights build`, `weights push`, `weights inspect`
181+
182+
There's also a separate `base-image` binary (`cmd/base-image/`) with subcommands for managing Cog base images (`dockerfile`, `build`, `generate-matrix`). This isn't a `cog` subcommand.
183+
172184
## How CLI Commands Interact with Containers
173185

174186
Commands like `predict`, `train`, and `serve` follow the same pattern: build an image, start a container, communicate via HTTP. The CLI never runs model code directly.
@@ -221,21 +233,31 @@ pkg/cli/
221233
└── init.go # cog init
222234
```
223235

224-
Commands delegate to packages:
225-
- `pkg/image/` - Image building
226-
- `pkg/dockerfile/` - Dockerfile generation
227-
- `pkg/docker/` - Docker client operations
228-
- `pkg/config/` - cog.yaml parsing
229-
- `pkg/web/` - Replicate API client
230-
- `pkg/predict/` - Local prediction execution
231-
232-
## Code References
233-
234-
| File | Purpose |
235-
|------|---------|
236-
| `pkg/cli/root.go` | Command registration |
237-
| `pkg/cli/build.go` | Build command |
238-
| `pkg/cli/predict.go` | Predict command, input parsing |
239-
| `pkg/cli/push.go` | Push command |
240-
| `pkg/image/build.go` | Build orchestration |
241-
| `pkg/predict/predictor.go` | Local prediction client |
236+
Commands delegate to packages under `pkg/`:
237+
238+
**Core:**
239+
- `pkg/cli/` -- Cobra command definitions
240+
- `pkg/config/` -- cog.yaml parsing and validation, compatibility matrices
241+
- `pkg/image/` -- Build orchestration (ties together config, Dockerfile generation, schema gen)
242+
- `pkg/dockerfile/` -- Dockerfile generation and base image selection
243+
- `pkg/docker/` -- Docker client operations
244+
- `pkg/predict/` -- Local prediction execution (talks to container's HTTP API)
245+
- `pkg/schema/` -- Static schema generator (tree-sitter, experimental)
246+
- `pkg/wheels/` -- SDK and coglet wheel resolution
247+
248+
**Infrastructure:**
249+
- `pkg/web/` -- Replicate API client (for `cog push`)
250+
- `pkg/http/` -- Authenticated HTTP transport
251+
- `pkg/registry/` -- OCI/Docker registry client
252+
- `pkg/model/` -- OCI artifact domain model
253+
- `pkg/weights/` -- Weight file discovery and checksums
254+
- `pkg/errors/` -- `CodedError` for user-facing errors with error codes
255+
256+
**Utilities:**
257+
- `pkg/dockercontext/` -- Docker build context directory management
258+
- `pkg/dockerignore/` -- `.dockerignore` parsing
259+
- `pkg/requirements/` -- `requirements.txt` parsing
260+
- `pkg/env/` -- `R8_*` environment variable config
261+
- `pkg/update/` -- CLI version update checker
262+
- `pkg/global/` -- Build-time metadata, process-wide config
263+
- `pkg/provider/` -- Abstracts registry-specific behavior for push workflows

crates/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,8 +161,11 @@ crates/
161161
│ ├── service.rs # PredictionService
162162
│ ├── webhook.rs # WebhookSender (retry, trace context)
163163
│ ├── version.rs # Version info
164-
│ ├── webhook.rs # Webhook sender
165164
│ ├── orchestrator.rs # Worker lifecycle, event loop (parent)
165+
│ ├── fd_redirect.rs # File descriptor redirection
166+
│ ├── input_validation.rs # Input validation against schema
167+
│ ├── setup_log_accumulator.rs # Accumulates logs during setup()
168+
│ ├── worker_tracing_layer.rs # Tracing layer for worker process
166169
│ ├── worker.rs # Worker event loop (child)
167170
│ ├── bridge/
168171
│ │ ├── mod.rs
@@ -191,7 +194,9 @@ crates/
191194
├── output.rs # Output serialization
192195
├── log_writer.rs # SlotLogWriter, ContextVar routing
193196
├── audit.rs # Audit hook, TeeWriter
194-
└── cancel.rs # Cancellation support
197+
├── cancel.rs # Cancellation support
198+
├── metric_scope.rs # Scope and MetricRecorder for record_metric()
199+
└── bin/stub_gen.rs # Type stub generator
195200
```
196201

197202
## Bridge Protocol

skills/updating-architecture-docs/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ The first gives you the mental model. The second just restates the code. A reade
4040

4141
Reference source locations at the **package/directory level** with a description of what that package owns. Specific file paths and line numbers rot as code moves around. A pointer like "`crates/coglet/src/bridge/` -- IPC protocol and transport" stays accurate through refactors. "`bridge/protocol.rs:69` -- ControlRequest enum" doesn't.
4242

43+
Only document packages that matter for understanding the system's shape. Generic utility packages (`pkg/util/`, `pkg/path/`, etc.) don't need a mention -- their existence is obvious and they don't help a reader build a mental model. If someone needs them, they'll find them.
44+
4345
When a specific file reference is genuinely useful (a key entry point, a non-obvious starting point for understanding a subsystem), include it -- but prefer "the `PredictionService` in `service.rs`" over a line number.
4446

4547
### Document boundaries, not internals

0 commit comments

Comments
 (0)