You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The response includes findings (what's wrong), raw probe data (full picture), per-check health, and coverage info. See [Architecture](docs/architecture.md) for the full JSON schema.
62
+
**Finding history** (persistent, transient, and flapping classification across cycles):
63
+
```bash
64
+
curl -s http://localhost:9090/api/history | jq .
65
+
```
66
+
67
+
**On-demand targeted check** (immediate, does not wait for next cycle):
The observer is **stateless** — no PVC, no persistent storage. Every cycle re-evaluates from scratch. In-memory trackers (restart counts, phase timestamps, drain state) reset when the pod restarts.
105
+
The observer is **mostly stateless** — no PVC, no persistent storage. Every cycle re-evaluates from scratch. In-memory trackers (restart counts, phase timestamps, drain state, finding history) reset when the pod restarts.
92
106
93
107
## What It Detects
94
108
@@ -147,7 +161,9 @@ tools/observer/
147
161
│ └── main.go # Entrypoint, flags, metrics server
148
162
├── pkg/
149
163
│ ├── observer/ # 10 check categories + API
150
-
│ │ ├── observer.go # Main loop, cycle orchestration, StatusHandler
@@ -175,7 +178,7 @@ Returns the latest cycle's complete diagnostic snapshot as JSON. Returns `503 Se
175
178
176
179
### Data Flow
177
180
178
-
Each `runCycle()` creates a fresh `ProbeCollector`. Check functions append raw probe data to the collector alongside reporting findings. At the end of the cycle, the reporter returns findings via `SummaryWithFindings()`, and the observer atomically stores a `StatusResponse` snapshot. The HTTP handler reads the latest snapshot under a `sync.RWMutex`.
181
+
Each `runCycle()` creates a fresh `ProbeCollector`. Check functions append raw probe data to the collector alongside reporting findings. At the end of the cycle, the reporter returns findings via `SummaryWithFindings()`, the observer atomically stores a `StatusResponse` snapshot, and records the cycle's findings into the history ring buffer.
On-demand checks (`/api/check`) are dispatched through a channel to the `Run` goroutine's select loop, ensuring check functions never race with the ticker cycle. A temporary reporter isolates findings from the main cycle.
199
+
200
+
### `GET /api/history`
201
+
202
+
Returns finding history across recent observer cycles. Findings are classified by their behavior over time. Returns `200` immediately with the current history window.
|`persistent`| Active, appeared in 75%+ of cycles (or <3 cycles total) | Consistently present — likely a real issue |
239
+
|`transient`| Resolved (no longer active) | Appeared then went away — may be expected during operations |
240
+
|`flapping`| Active, 3+ appearances but <75% of cycles | Intermittent — possible race condition or instability |
241
+
242
+
The history uses a ring buffer sized by `--history-capacity` (default 30 cycles). Resolved occurrences older than the oldest cycle in the buffer are automatically pruned. Finding identity is based on a truncated SHA-256 of `check|component|message`.
243
+
244
+
### `GET /api/check`
245
+
246
+
Triggers an immediate on-demand check without waiting for the next ticker cycle. Returns a `StatusResponse` (same schema as `/api/status`) scoped to the requested check categories.
188
247
189
-
StatusHandler()
190
-
└── snap.Load() → JSON response
248
+
**Query parameters:**
249
+
250
+
| Parameter | Required | Description |
251
+
|-----------|----------|-------------|
252
+
|`categories`| No | Comma-separated list of check categories to run. If omitted, all checks are run. |
|`400 Bad Request`| No valid categories in the `categories` parameter |
269
+
|`429 Too Many Requests`| Another on-demand check is already in progress |
270
+
|`504 Gateway Timeout`| Check did not complete within 30 seconds |
271
+
272
+
**Design note:** On-demand checks execute within the observer's main `Run` goroutine via a channel-based request/response. This is necessary because check functions mutate shared state (`podStartup`, `knownPodNames`, `prevRestarts`). The handler creates a temporary `Reporter` so findings do not leak into the main cycle's reporter. The on-demand check does NOT update the `/api/status` snapshot or the finding history.
273
+
193
274
### Other Endpoints
194
275
195
276
| Path | Description |
@@ -283,6 +364,8 @@ The observer maintains several tracking maps that persist across cycles but rese
283
364
| `podStartup` | `pod-name` | Pool pod creation time + readiness for startup grace period |
0 commit comments