-
Notifications
You must be signed in to change notification settings - Fork 102
Open
Description
Summary
If the collector DaemonSet gmp-system/collector is deleted or never created, the operator logs a warning and exits reconcile successfully without recreating it. Delete events are also filtered out by the controller’s watch predicates. This leaves the cluster without collectors indefinitely and without status surfacing a degraded state.
Steps to Reproduce
- Deploy the operator normally so collectors are managed by the operator.
- Delete the collector DaemonSet:
kubectl -n gmp-system delete ds collector. - Wait for the operator to reconcile.
Observed Behavior
- Operator logs a warning that the DaemonSet does not exist, returns success, and does not recreate it. See
ensureCollectorDaemonSethandling of NotFound in pkg/operator/collection.go#L219-L231. - Delete events are not enqueued because the controller watches the DaemonSet with
GenerationChangedPredicate, which ignores deletes; no reconcile is triggered on deletion (same file around pkg/operator/collection.go#L106-L113). - Collection remains down permanently; no status condition reflects the loss of collectors.
Expected Behavior
- The operator should reconcile on DaemonSet deletion and recreate or at least mark degraded, so collectors come back automatically and the condition is visible to users.
Impact
- Cluster-wide metrics ingestion halts until a human notices and re-applies the DaemonSet. This is a silent data-loss mode: no collectors, no samples, and no operator status signal. In busy clusters, this can wipe out minutes to hours of data and break alerting without clear diagnosis.
Proposed Fix (high-level)
- Watch DaemonSet delete events (do not filter them out) and enqueue reconcile on delete.
- In
ensureCollectorDaemonSet, treat NotFound as a reconcile error or recreate the DaemonSet; alternatively set a Degraded condition on the OperatorConfig status so the outage is surfaced.
Discussion
Please confirm whether this behavior is intentional or a bug
Happy to adjust the understanding or help with a fix/iteration
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels