Skip to content

Operator does not self-heal collector DaemonSet deletion, causing silent collection outage #1830

@AnkanMisra

Description

@AnkanMisra

Summary

If the collector DaemonSet gmp-system/collector is deleted or never created, the operator logs a warning and exits reconcile successfully without recreating it. Delete events are also filtered out by the controller’s watch predicates. This leaves the cluster without collectors indefinitely and without status surfacing a degraded state.

Steps to Reproduce

  1. Deploy the operator normally so collectors are managed by the operator.
  2. Delete the collector DaemonSet: kubectl -n gmp-system delete ds collector.
  3. Wait for the operator to reconcile.

Observed Behavior

  • Operator logs a warning that the DaemonSet does not exist, returns success, and does not recreate it. See ensureCollectorDaemonSet handling of NotFound in pkg/operator/collection.go#L219-L231.
  • Delete events are not enqueued because the controller watches the DaemonSet with GenerationChangedPredicate, which ignores deletes; no reconcile is triggered on deletion (same file around pkg/operator/collection.go#L106-L113).
  • Collection remains down permanently; no status condition reflects the loss of collectors.

Expected Behavior

  • The operator should reconcile on DaemonSet deletion and recreate or at least mark degraded, so collectors come back automatically and the condition is visible to users.

Impact

  • Cluster-wide metrics ingestion halts until a human notices and re-applies the DaemonSet. This is a silent data-loss mode: no collectors, no samples, and no operator status signal. In busy clusters, this can wipe out minutes to hours of data and break alerting without clear diagnosis.

Proposed Fix (high-level)

  • Watch DaemonSet delete events (do not filter them out) and enqueue reconcile on delete.
  • In ensureCollectorDaemonSet, treat NotFound as a reconcile error or recreate the DaemonSet; alternatively set a Degraded condition on the OperatorConfig status so the outage is surfaced.

Discussion

Please confirm whether this behavior is intentional or a bug

Happy to adjust the understanding or help with a fix/iteration

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions