Skip to content

Commit 792fd3d

Browse files
Merge pull request #377 from numtide/feat/observer-history-and-check-endpoints
feat(observer): add history and on-demand check endpoints
2 parents 4f52429 + 63c3568 commit 792fd3d

File tree

12 files changed

+711
-15
lines changed

12 files changed

+711
-15
lines changed

config/samples/default-templates/shard.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ spec:
1616
pools:
1717
default:
1818
type: "readWrite"
19-
replicasPerCell: 3
19+
replicasPerCell: 4
2020
storage:
2121
size: "2Gi"
2222
class: "standard"

config/samples/overrides.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,6 @@ spec:
4444
# Define a new pool not in template. This will produce a warning in case it's a typo.
4545
"analytics":
4646
type: "readOnly"
47-
replicasPerCell: 3
47+
replicasPerCell: 4
4848
storage:
4949
size: "50Gi"

config/samples/templates/shard.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ spec:
1616
pools:
1717
main-app:
1818
type: "readWrite"
19-
replicasPerCell: 3
19+
replicasPerCell: 4
2020
storage:
2121
class: "standard"
2222
size: "100Gi"

tools/observer/README.md

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,14 +48,28 @@ Findings are structured JSON — one line per finding with severity, check name,
4848

4949
### Diagnostic API
5050

51-
The observer exposes a structured JSON endpoint with the complete diagnostic snapshot from the latest cycle:
51+
The observer exposes structured JSON endpoints for diagnostics:
5252

5353
```bash
5454
KUBECONFIG=kubeconfig.yaml kubectl port-forward svc/multigres-observer -n multigres-operator 9090:9090
55+
```
56+
57+
**Latest cycle snapshot:**
58+
```bash
5559
curl -s http://localhost:9090/api/status | jq .
5660
```
5761

58-
The response includes findings (what's wrong), raw probe data (full picture), per-check health, and coverage info. See [Architecture](docs/architecture.md) for the full JSON schema.
62+
**Finding history** (persistent, transient, and flapping classification across cycles):
63+
```bash
64+
curl -s http://localhost:9090/api/history | jq .
65+
```
66+
67+
**On-demand targeted check** (immediate, does not wait for next cycle):
68+
```bash
69+
curl -s 'http://localhost:9090/api/check?categories=pod-health,connectivity' | jq .
70+
```
71+
72+
See [Architecture](docs/architecture.md) for the full JSON schemas and endpoint details.
5973

6074
### Prometheus Metrics
6175

@@ -88,7 +102,7 @@ curl http://localhost:9090/metrics
88102
CRDs, Pods, Events, Logs PostgreSQL on pool pods
89103
```
90104

91-
The observer is **stateless** — no PVC, no persistent storage. Every cycle re-evaluates from scratch. In-memory trackers (restart counts, phase timestamps, drain state) reset when the pod restarts.
105+
The observer is **mostly stateless** — no PVC, no persistent storage. Every cycle re-evaluates from scratch. In-memory trackers (restart counts, phase timestamps, drain state, finding history) reset when the pod restarts.
92106

93107
## What It Detects
94108

@@ -147,7 +161,9 @@ tools/observer/
147161
│ └── main.go # Entrypoint, flags, metrics server
148162
├── pkg/
149163
│ ├── observer/ # 10 check categories + API
150-
│ │ ├── observer.go # Main loop, cycle orchestration, StatusHandler
164+
│ │ ├── observer.go # Main loop, cycle orchestration, HTTP handlers
165+
│ │ ├── history.go # Finding history ring buffer + classification
166+
│ │ ├── history_test.go # Unit tests for history
151167
│ │ ├── probes.go # Per-cycle probe data collector
152168
│ │ ├── snapshot.go # Thread-safe latest-cycle snapshot store
153169
│ │ ├── pods.go # Pod health, restarts, OOM, counts

tools/observer/cmd/multigres-observer/main.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ func main() {
3434
metricsAddr string
3535
logTailLines int
3636
enableSQLProbe bool
37+
historyCapacity int
3738
)
3839

3940
flag.StringVar(
@@ -69,6 +70,12 @@ func main() {
6970
true,
7071
"Enable SQL probes for replication health and connectivity checks",
7172
)
73+
flag.IntVar(
74+
&historyCapacity,
75+
"history-capacity",
76+
30,
77+
"Number of observer cycles to retain in finding history (default 30 = ~5 min at 10s interval)",
78+
)
7279
flag.Parse()
7380

7481
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelDebug}))
@@ -103,6 +110,7 @@ func main() {
103110
Logger: logger,
104111
LogTailLines: logTailLines,
105112
EnableSQLProbe: enableSQLProbe,
113+
HistoryCapacity: historyCapacity,
106114
})
107115

108116
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
@@ -114,6 +122,8 @@ func main() {
114122
w.WriteHeader(http.StatusOK)
115123
})
116124
mux.HandleFunc("/api/status", obs.StatusHandler())
125+
mux.HandleFunc("/api/history", obs.HistoryHandler())
126+
mux.HandleFunc("/api/check", obs.CheckHandler())
117127

118128
srv := &http.Server{Addr: metricsAddr, Handler: mux, ReadHeaderTimeout: 5 * time.Second}
119129
go func() {

tools/observer/docs/architecture.md

Lines changed: 87 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,11 @@ The observer runs a ticker loop. Each tick executes all 10 check categories sequ
4343
│ track("replication", checkReplication) │
4444
│ │
4545
│ → Summary: {findings: N, errors: M, fatals: K} │
46+
│ → history.Record(start, end, findings) │
4647
│ → Metric: multigres_observer_observer_cycle_duration_seconds│
4748
│ → Metric: multigres_observer_check_healthy{check=X} = 0|1 │
49+
│ │
50+
│ Run() select also listens on onDemandCh for /api/check │
4851
└──────────────────────────────────────────────────────────────┘
4952
```
5053

@@ -175,7 +178,7 @@ Returns the latest cycle's complete diagnostic snapshot as JSON. Returns `503 Se
175178

176179
### Data Flow
177180

178-
Each `runCycle()` creates a fresh `ProbeCollector`. Check functions append raw probe data to the collector alongside reporting findings. At the end of the cycle, the reporter returns findings via `SummaryWithFindings()`, and the observer atomically stores a `StatusResponse` snapshot. The HTTP handler reads the latest snapshot under a `sync.RWMutex`.
181+
Each `runCycle()` creates a fresh `ProbeCollector`. Check functions append raw probe data to the collector alongside reporting findings. At the end of the cycle, the reporter returns findings via `SummaryWithFindings()`, the observer atomically stores a `StatusResponse` snapshot, and records the cycle's findings into the history ring buffer.
179182

180183
```
181184
runCycle()
@@ -184,12 +187,90 @@ runCycle()
184187
├── track("connectivity", ...) → findings + probes.RecordProbe(...)
185188
├── ...
186189
├── reporter.SummaryWithFindings() → []Finding, Summary, healthy
187-
└── snap.Store(&StatusResponse{...})
190+
├── snap.Store(&StatusResponse{...})
191+
└── history.Record(start, end, findings)
192+
193+
StatusHandler() → snap.Load() → JSON response
194+
HistoryHandler() → history.Build() → classified occurrences
195+
CheckHandler() → onDemandCh → runOnDemand() → temporary StatusResponse
196+
```
197+
198+
On-demand checks (`/api/check`) are dispatched through a channel to the `Run` goroutine's select loop, ensuring check functions never race with the ticker cycle. A temporary reporter isolates findings from the main cycle.
199+
200+
### `GET /api/history`
201+
202+
Returns finding history across recent observer cycles. Findings are classified by their behavior over time. Returns `200` immediately with the current history window.
203+
204+
```json
205+
{
206+
"totalCycles": 30,
207+
"windowStart": "2026-03-06T09:55:00Z",
208+
"windowEnd": "2026-03-06T10:00:00Z",
209+
"persistent": [
210+
{
211+
"key": "a1b2c3d4e5f60718",
212+
"check": "connectivity",
213+
"component": "multigateway-svc",
214+
"message": "multigateway-pg: TCP probe failed ...",
215+
"severity": "error",
216+
"firstSeen": "2026-03-06T09:55:10Z",
217+
"lastSeen": "2026-03-06T10:00:00Z",
218+
"count": 30,
219+
"active": true
220+
}
221+
],
222+
"transient": [],
223+
"flapping": [],
224+
"cycles": [
225+
{
226+
"cycleStart": "2026-03-06T09:59:50Z",
227+
"cycleEnd": "2026-03-06T10:00:00Z",
228+
"findings": [...]
229+
}
230+
]
231+
}
232+
```
233+
234+
**Classification rules:**
235+
236+
| Category | Condition | Meaning |
237+
|----------|-----------|---------|
238+
| `persistent` | Active, appeared in 75%+ of cycles (or <3 cycles total) | Consistently present — likely a real issue |
239+
| `transient` | Resolved (no longer active) | Appeared then went away — may be expected during operations |
240+
| `flapping` | Active, 3+ appearances but <75% of cycles | Intermittent — possible race condition or instability |
241+
242+
The history uses a ring buffer sized by `--history-capacity` (default 30 cycles). Resolved occurrences older than the oldest cycle in the buffer are automatically pruned. Finding identity is based on a truncated SHA-256 of `check|component|message`.
243+
244+
### `GET /api/check`
245+
246+
Triggers an immediate on-demand check without waiting for the next ticker cycle. Returns a `StatusResponse` (same schema as `/api/status`) scoped to the requested check categories.
188247

189-
StatusHandler()
190-
└── snap.Load() → JSON response
248+
**Query parameters:**
249+
250+
| Parameter | Required | Description |
251+
|-----------|----------|-------------|
252+
| `categories` | No | Comma-separated list of check categories to run. If omitted, all checks are run. |
253+
254+
**Valid categories:** `pod-health`, `resource-validation`, `crd-status`, `drain-state`, `connectivity`, `logs`, `events`, `topology`, `replication`
255+
256+
```bash
257+
# Run only pod-health and connectivity checks
258+
curl -s 'http://localhost:9090/api/check?categories=pod-health,connectivity' | jq .
259+
260+
# Run all checks on demand
261+
curl -s http://localhost:9090/api/check | jq .
191262
```
192263

264+
**Error responses:**
265+
266+
| Status | Condition |
267+
|--------|-----------|
268+
| `400 Bad Request` | No valid categories in the `categories` parameter |
269+
| `429 Too Many Requests` | Another on-demand check is already in progress |
270+
| `504 Gateway Timeout` | Check did not complete within 30 seconds |
271+
272+
**Design note:** On-demand checks execute within the observer's main `Run` goroutine via a channel-based request/response. This is necessary because check functions mutate shared state (`podStartup`, `knownPodNames`, `prevRestarts`). The handler creates a temporary `Reporter` so findings do not leak into the main cycle's reporter. The on-demand check does NOT update the `/api/status` snapshot or the finding history.
273+
193274
### Other Endpoints
194275

195276
| Path | Description |
@@ -283,6 +364,8 @@ The observer maintains several tracking maps that persist across cycles but rese
283364
| `podStartup` | `pod-name` | Pool pod creation time + readiness for startup grace period |
284365
| `lastLogCheck` | single timestamp | Avoid re-tailing already-checked logs |
285366
| `lastEventResourceVersion` | single string | Only process new events each cycle |
367+
| `history` | `*findingHistory` | Ring buffer of cycle records + finding occurrence tracking |
368+
| `onDemandCh` | `chan checkRequest` | Channel for on-demand check requests from `/api/check` |
286369

287370
This state is purely observational — losing it on restart is safe. The observer re-converges within 1-2 cycles.
288371

tools/observer/docs/configuration.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,10 @@ All flags, environment variables, thresholds, and tunable parameters for the obs
1313
| `--interval` | `10s` || Time between observer cycles |
1414
| `--kubeconfig` | `""` | `KUBECONFIG` | Path to kubeconfig. Empty = in-cluster config |
1515
| `--once` | `false` || Run one cycle and exit (useful for CI) |
16-
| `--metrics-addr` | `:9090` || Address for Prometheus metrics, health, and `/api/status` endpoint |
16+
| `--metrics-addr` | `:9090` || Address for Prometheus metrics, health, and API endpoints (`/api/status`, `/api/history`, `/api/check`) |
1717
| `--log-tail-lines` | `100` || Lines to tail from each container per cycle |
1818
| `--enable-sql-probe` | `true` || Enable SQL probes for replication health and gateway connectivity |
19+
| `--history-capacity` | `30` || Number of observer cycles to retain in finding history (30 = ~5 min at 10s interval) |
1920

2021
Environment variables override the corresponding flag only when the flag is at its default value. The flag always takes precedence.
2122

0 commit comments

Comments
 (0)