Skip to content

[ACM-27932] Remove offset 6h from ACMThanosCompactHasNotRun alert#2413

Open
subbarao-meduri wants to merge 1 commit intomainfrom
ACM-27932-fix-compact-alert-offset
Open

[ACM-27932] Remove offset 6h from ACMThanosCompactHasNotRun alert#2413
subbarao-meduri wants to merge 1 commit intomainfrom
ACM-27932-fix-compact-alert-offset

Conversation

@subbarao-meduri
Copy link
Copy Markdown
Collaborator

@subbarao-meduri subbarao-meduri commented Apr 16, 2026

Summary

  • Remove offset 6h from the ACMThanosCompactHasNotRun PromQL expression to align with the upstream Thanos alert definition and eliminate false positives after compactor recovery

Problem

The ACMThanosCompactHasNotRun alert expression uses offset 6h, which shifts the max_over_time lookback window 6 hours into the past (evaluating 6h–30h ago instead of 0h–24h ago). This creates a blind spot: after a compactor recovers from a crash loop or corrupted blocks, new successful uploads are invisible to the alert for up to 6 hours, causing false positive alerts even though the compactor is healthy.

Upstream references

Source of truth (jsonnet mixin):
https://github.com/thanos-io/thanos/blob/main/mixin/alerts/compact.libsonnet

Generated alerts:
https://github.com/thanos-io/thanos/blob/main/examples/alerts/alerts.yaml#L52

Summary by CodeRabbit

  • Bug Fixes
    • Improved Thanos compact upload alert logic to more accurately compute the time since the last successful upload by removing an internal retrospective offset. Alerts now compare the last upload timestamp directly to current time, reducing false positives and improving timeliness and reliability of compact upload monitoring across clusters.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ec1213d8-110e-4149-ac72-bc58c82e2702

📥 Commits

Reviewing files that changed from the base of the PR and between 7be3e38 and ecf304c.

📒 Files selected for processing (1)
  • operators/multiclusterobservability/manifests/base/alertmanager/prometheusrule.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • operators/multiclusterobservability/manifests/base/alertmanager/prometheusrule.yaml

📝 Walkthrough

Walkthrough

The ACMThanosCompactHasNotRun PrometheusRule alert expression was modified to remove offset 6h from the max_over_time(...) subexpression; it now evaluates the last successful upload timestamp against current time. The alert threshold (>24 hours) remains unchanged.

Changes

Cohort / File(s) Summary
Alert Expression Modification
operators/multiclusterobservability/manifests/base/alertmanager/prometheusrule.yaml
Removed offset 6h from the max_over_time() subexpression in the Thanos Compact alert rule so the last upload time is compared to the current time (threshold still >24h).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A six-hour ghost was tucked away,
We nudged the clock to live today.
Metrics breathe in present air,
Alerts sound true — we caught the snare. 🔍🚨✅

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the core change: removing offset 6h from the ACMThanosCompactHasNotRun alert expression. It maps cleanly to the actual modification in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ACM-27932-fix-compact-alert-offset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jacobbaungard
Copy link
Copy Markdown
Contributor

Doesn't this just change the time it takes for the alert to recover?

How does the alert work on newly installed systems? Without the offset, wouldn't it instantly alert on new installs, since it takes some times before the first blocks are uploaded?

Copy link
Copy Markdown
Contributor

@thibaultmg thibaultmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: subbarao-meduri, thibaultmg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [subbarao-meduri,thibaultmg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@subbarao-meduri subbarao-meduri force-pushed the ACM-27932-fix-compact-alert-offset branch from 3cb2e49 to 30725aa Compare April 20, 2026 12:47
@openshift-ci openshift-ci Bot removed the lgtm label Apr 20, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 20, 2026

New changes are detected. LGTM label has been removed.

@subbarao-meduri subbarao-meduri force-pushed the ACM-27932-fix-compact-alert-offset branch 2 times, most recently from 7b99c27 to 7be3e38 Compare April 22, 2026 10:43
The offset 6h in the PromQL expression creates a 6-hour blind spot
that ignores recent successful uploads after compactor recovery,
causing false positive alerts. This deviates from the upstream Thanos
alert definition which does not use an offset. The 24h lookback window
in max_over_time already handles metric resets during pod restarts.

Signed-off-by: Subbarao Meduri <smeduri@redhat.com>
@subbarao-meduri subbarao-meduri force-pushed the ACM-27932-fix-compact-alert-offset branch from 7be3e38 to ecf304c Compare April 22, 2026 15:20
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 22, 2026

@subbarao-meduri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/test-e2e ecf304c link true /test test-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants