[ACM-27932] Remove offset 6h from ACMThanosCompactHasNotRun alert#2413
[ACM-27932] Remove offset 6h from ACMThanosCompactHasNotRun alert#2413subbarao-meduri wants to merge 1 commit intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Doesn't this just change the time it takes for the alert to recover? How does the alert work on newly installed systems? Without the offset, wouldn't it instantly alert on new installs, since it takes some times before the first blocks are uploaded? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: subbarao-meduri, thibaultmg The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3cb2e49 to
30725aa
Compare
|
New changes are detected. LGTM label has been removed. |
7b99c27 to
7be3e38
Compare
The offset 6h in the PromQL expression creates a 6-hour blind spot that ignores recent successful uploads after compactor recovery, causing false positive alerts. This deviates from the upstream Thanos alert definition which does not use an offset. The 24h lookback window in max_over_time already handles metric resets during pod restarts. Signed-off-by: Subbarao Meduri <smeduri@redhat.com>
7be3e38 to
ecf304c
Compare
|
@subbarao-meduri: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
offset 6hfrom theACMThanosCompactHasNotRunPromQL expression to align with the upstream Thanos alert definition and eliminate false positives after compactor recoveryProblem
The
ACMThanosCompactHasNotRunalert expression usesoffset 6h, which shifts themax_over_timelookback window 6 hours into the past (evaluating 6h–30h ago instead of 0h–24h ago). This creates a blind spot: after a compactor recovers from a crash loop or corrupted blocks, new successful uploads are invisible to the alert for up to 6 hours, causing false positive alerts even though the compactor is healthy.Upstream references
Source of truth (jsonnet mixin):
https://github.com/thanos-io/thanos/blob/main/mixin/alerts/compact.libsonnet
Generated alerts:
https://github.com/thanos-io/thanos/blob/main/examples/alerts/alerts.yaml#L52
Summary by CodeRabbit