numtide
diff --git a/‎config/samples/default-templates/shard.yaml‎
Lines changed: 1 addition & 1 deletion b/‎config/samples/default-templates/shard.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎config/samples/overrides.yaml‎
Lines changed: 1 addition & 1 deletion b/‎config/samples/overrides.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎config/samples/templates/shard.yaml‎
Lines changed: 1 addition & 1 deletion b/‎config/samples/templates/shard.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tools/observer/README.md‎
Lines changed: 20 additions & 4 deletions b/‎tools/observer/README.md‎
Lines changed: 20 additions & 4 deletions
diff --git a/‎tools/observer/cmd/multigres-observer/main.go‎
Lines changed: 10 additions & 0 deletions b/‎tools/observer/cmd/multigres-observer/main.go‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎tools/observer/docs/architecture.md‎
Lines changed: 87 additions & 4 deletions b/‎tools/observer/docs/architecture.md‎
Lines changed: 87 additions & 4 deletions
diff --git a/‎tools/observer/docs/configuration.md‎
Lines changed: 2 additions & 1 deletion b/‎tools/observer/docs/configuration.md‎
Lines changed: 2 additions & 1 deletion
@@ -16,7 +16,7 @@ spec:
   pools:
     default:
       type: "readWrite"
-      replicasPerCell: 3
+      replicasPerCell: 4
       storage:
         size: "2Gi"
         class: "standard"
 
@@ -44,6 +44,6 @@ spec:
                     # Define a new pool not in template. This will produce a warning in case it's a typo.
                   "analytics":
                     type: "readOnly"
-                    replicasPerCell: 3
+                    replicasPerCell: 4
                     storage:
                       size: "50Gi"
@@ -16,7 +16,7 @@ spec:
   pools:
     main-app:
       type: "readWrite"
-      replicasPerCell: 3
+      replicasPerCell: 4
       storage:
         class: "standard"
         size: "100Gi"
 
@@ -48,14 +48,28 @@ Findings are structured JSON — one line per finding with severity, check name,
 
 ### Diagnostic API
 
-The observer exposes a structured JSON endpoint with the complete diagnostic snapshot from the latest cycle:
+The observer exposes structured JSON endpoints for diagnostics:
 
 ```bash
 KUBECONFIG=kubeconfig.yaml kubectl port-forward svc/multigres-observer -n multigres-operator 9090:9090
+```
+
+**Latest cycle snapshot:**
+```bash
 curl -s http://localhost:9090/api/status | jq .
 ```
 
-The response includes findings (what's wrong), raw probe data (full picture), per-check health, and coverage info. See [Architecture](docs/architecture.md) for the full JSON schema.
+**Finding history** (persistent, transient, and flapping classification across cycles):
+```bash
+curl -s http://localhost:9090/api/history | jq .
+```
+
+**On-demand targeted check** (immediate, does not wait for next cycle):
+```bash
+curl -s 'http://localhost:9090/api/check?categories=pod-health,connectivity' | jq .
+```
+
+See [Architecture](docs/architecture.md) for the full JSON schemas and endpoint details.
 
 ### Prometheus Metrics
 
@@ -88,7 +102,7 @@ curl http://localhost:9090/metrics
    CRDs, Pods, Events, Logs      PostgreSQL on pool pods
 ```
 
-The observer is **stateless** — no PVC, no persistent storage. Every cycle re-evaluates from scratch. In-memory trackers (restart counts, phase timestamps, drain state) reset when the pod restarts.
+The observer is **mostly stateless** — no PVC, no persistent storage. Every cycle re-evaluates from scratch. In-memory trackers (restart counts, phase timestamps, drain state, finding history) reset when the pod restarts.
 
 ## What It Detects
 
@@ -147,7 +161,9 @@ tools/observer/
 │   └── main.go                    # Entrypoint, flags, metrics server
 ├── pkg/
 │   ├── observer/                  # 10 check categories + API
-│   │   ├── observer.go            # Main loop, cycle orchestration, StatusHandler
+│   │   ├── observer.go            # Main loop, cycle orchestration, HTTP handlers
+│   │   ├── history.go             # Finding history ring buffer + classification
+│   │   ├── history_test.go        # Unit tests for history
 │   │   ├── probes.go              # Per-cycle probe data collector
 │   │   ├── snapshot.go            # Thread-safe latest-cycle snapshot store
 │   │   ├── pods.go                # Pod health, restarts, OOM, counts
 
@@ -34,6 +34,7 @@ func main() {
 		metricsAddr       string
 		logTailLines      int
 		enableSQLProbe    bool
+		historyCapacity   int
 	)
 
 	flag.StringVar(
@@ -69,6 +70,12 @@ func main() {
 		true,
 		"Enable SQL probes for replication health and connectivity checks",
 	)
+	flag.IntVar(
+		&historyCapacity,
+		"history-capacity",
+		30,
+		"Number of observer cycles to retain in finding history (default 30 = ~5 min at 10s interval)",
+	)
 	flag.Parse()
 
 	logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelDebug}))
@@ -103,6 +110,7 @@ func main() {
 		Logger:            logger,
 		LogTailLines:      logTailLines,
 		EnableSQLProbe:    enableSQLProbe,
+		HistoryCapacity:   historyCapacity,
 	})
 
 	ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
@@ -114,6 +122,8 @@ func main() {
 		w.WriteHeader(http.StatusOK)
 	})
 	mux.HandleFunc("/api/status", obs.StatusHandler())
+	mux.HandleFunc("/api/history", obs.HistoryHandler())
+	mux.HandleFunc("/api/check", obs.CheckHandler())
 
 	srv := &http.Server{Addr: metricsAddr, Handler: mux, ReadHeaderTimeout: 5 * time.Second}
 	go func() {
 
@@ -43,8 +43,11 @@ The observer runs a ticker loop. Each tick executes all 10 check categories sequ
 │  track("replication",         checkReplication)              │
 │                                                              │
 │  → Summary: {findings: N, errors: M, fatals: K}              │
+│  → history.Record(start, end, findings)                       │
 │  → Metric: multigres_observer_observer_cycle_duration_seconds│
 │  → Metric: multigres_observer_check_healthy{check=X} = 0|1   │
+│                                                              │
+│  Run() select also listens on onDemandCh for /api/check      │
 └──────────────────────────────────────────────────────────────┘
 ```
 
@@ -175,7 +178,7 @@ Returns the latest cycle's complete diagnostic snapshot as JSON. Returns `503 Se
 
 ### Data Flow
 
-Each `runCycle()` creates a fresh `ProbeCollector`. Check functions append raw probe data to the collector alongside reporting findings. At the end of the cycle, the reporter returns findings via `SummaryWithFindings()`, and the observer atomically stores a `StatusResponse` snapshot. The HTTP handler reads the latest snapshot under a `sync.RWMutex`.
+Each `runCycle()` creates a fresh `ProbeCollector`. Check functions append raw probe data to the collector alongside reporting findings. At the end of the cycle, the reporter returns findings via `SummaryWithFindings()`, the observer atomically stores a `StatusResponse` snapshot, and records the cycle's findings into the history ring buffer.
 
 ```
 runCycle()
@@ -184,12 +187,90 @@ runCycle()
   ├── track("connectivity", ...) → findings + probes.RecordProbe(...)
   ├── ...
   ├── reporter.SummaryWithFindings() → []Finding, Summary, healthy
-  └── snap.Store(&StatusResponse{...})
+  ├── snap.Store(&StatusResponse{...})
+  └── history.Record(start, end, findings)
+
+StatusHandler()   → snap.Load() → JSON response
+HistoryHandler()  → history.Build() → classified occurrences
+CheckHandler()    → onDemandCh → runOnDemand() → temporary StatusResponse
+```
+
+On-demand checks (`/api/check`) are dispatched through a channel to the `Run` goroutine's select loop, ensuring check functions never race with the ticker cycle. A temporary reporter isolates findings from the main cycle.
+
+### `GET /api/history`
+
+Returns finding history across recent observer cycles. Findings are classified by their behavior over time. Returns `200` immediately with the current history window.
+
+```json
+{
+  "totalCycles": 30,
+  "windowStart": "2026-03-06T09:55:00Z",
+  "windowEnd": "2026-03-06T10:00:00Z",
+  "persistent": [
+    {
+      "key": "a1b2c3d4e5f60718",
+      "check": "connectivity",
+      "component": "multigateway-svc",
+      "message": "multigateway-pg: TCP probe failed ...",
+      "severity": "error",
+      "firstSeen": "2026-03-06T09:55:10Z",
+      "lastSeen": "2026-03-06T10:00:00Z",
+      "count": 30,
+      "active": true
+    }
+  ],
+  "transient": [],
+  "flapping": [],
+  "cycles": [
+    {
+      "cycleStart": "2026-03-06T09:59:50Z",
+      "cycleEnd": "2026-03-06T10:00:00Z",
+      "findings": [...]
+    }
+  ]
+}
+```
+
+**Classification rules:**
+
+| Category | Condition | Meaning |
+|----------|-----------|---------|
+| `persistent` | Active, appeared in 75%+ of cycles (or <3 cycles total) | Consistently present — likely a real issue |
+| `transient` | Resolved (no longer active) | Appeared then went away — may be expected during operations |
+| `flapping` | Active, 3+ appearances but <75% of cycles | Intermittent — possible race condition or instability |
+
+The history uses a ring buffer sized by `--history-capacity` (default 30 cycles). Resolved occurrences older than the oldest cycle in the buffer are automatically pruned. Finding identity is based on a truncated SHA-256 of `check|component|message`.
+
+### `GET /api/check`
+
+Triggers an immediate on-demand check without waiting for the next ticker cycle. Returns a `StatusResponse` (same schema as `/api/status`) scoped to the requested check categories.
 
-StatusHandler()
-  └── snap.Load() → JSON response
+**Query parameters:**
+
+| Parameter | Required | Description |
+|-----------|----------|-------------|
+| `categories` | No | Comma-separated list of check categories to run. If omitted, all checks are run. |
+
+**Valid categories:** `pod-health`, `resource-validation`, `crd-status`, `drain-state`, `connectivity`, `logs`, `events`, `topology`, `replication`
+
+```bash
+# Run only pod-health and connectivity checks
+curl -s 'http://localhost:9090/api/check?categories=pod-health,connectivity' | jq .
+
+# Run all checks on demand
+curl -s http://localhost:9090/api/check | jq .
 ```
 
+**Error responses:**
+
+| Status | Condition |
+|--------|-----------|
+| `400 Bad Request` | No valid categories in the `categories` parameter |
+| `429 Too Many Requests` | Another on-demand check is already in progress |
+| `504 Gateway Timeout` | Check did not complete within 30 seconds |
+
+**Design note:** On-demand checks execute within the observer's main `Run` goroutine via a channel-based request/response. This is necessary because check functions mutate shared state (`podStartup`, `knownPodNames`, `prevRestarts`). The handler creates a temporary `Reporter` so findings do not leak into the main cycle's reporter. The on-demand check does NOT update the `/api/status` snapshot or the finding history.
+
 ### Other Endpoints
 
 | Path | Description |
@@ -283,6 +364,8 @@ The observer maintains several tracking maps that persist across cycles but rese
 | `podStartup` | `pod-name` | Pool pod creation time + readiness for startup grace period |
 | `lastLogCheck` | single timestamp | Avoid re-tailing already-checked logs |
 | `lastEventResourceVersion` | single string | Only process new events each cycle |
+| `history` | `*findingHistory` | Ring buffer of cycle records + finding occurrence tracking |
+| `onDemandCh` | `chan checkRequest` | Channel for on-demand check requests from `/api/check` |
 
 This state is purely observational — losing it on restart is safe. The observer re-converges within 1-2 cycles.
 
 
@@ -13,9 +13,10 @@ All flags, environment variables, thresholds, and tunable parameters for the obs
 | `--interval` | `10s` | — | Time between observer cycles |
 | `--kubeconfig` | `""` | `KUBECONFIG` | Path to kubeconfig. Empty = in-cluster config |
 | `--once` | `false` | — | Run one cycle and exit (useful for CI) |
-| `--metrics-addr` | `:9090` | — | Address for Prometheus metrics, health, and `/api/status` endpoint |
+| `--metrics-addr` | `:9090` | — | Address for Prometheus metrics, health, and API endpoints (`/api/status`, `/api/history`, `/api/check`) |
 | `--log-tail-lines` | `100` | — | Lines to tail from each container per cycle |
 | `--enable-sql-probe` | `true` | — | Enable SQL probes for replication health and gateway connectivity |
+| `--history-capacity` | `30` | — | Number of observer cycles to retain in finding history (30 = ~5 min at 10s interval) |
 
 Environment variables override the corresponding flag only when the flag is at its default value. The flag always takes precedence.