Skip to content

fix: bluegreen analysis prematurely succeeds if new ReplicaSet becomes unsaturated#4604

Merged
zachaller merged 3 commits intoargoproj:masterfrom
jessesuen:fix/premature-promotion
Jan 30, 2026
Merged

fix: bluegreen analysis prematurely succeeds if new ReplicaSet becomes unsaturated#4604
zachaller merged 3 commits intoargoproj:masterfrom
jessesuen:fix/premature-promotion

Conversation

@jessesuen
Copy link
Member

@jessesuen jessesuen commented Jan 28, 2026

Resolves #3724.

During blue-green rollout reconciliation, we have two methods that determine whether to skip and cancel the pre- and post-promotion analysis (skipPrePromotionAnalysisRun, skipPostPromotionAnalysisRun).

One of the checks in those methods is to see whether the new ReplicaSet is fully saturated. If not, we return true that analysis should be skipped/cancelled. The intended purpose of this check is to prevent analysis from starting unless the newRS is fully up and saturated. In the happy path, where Pods never come down after becoming saturated, this is not a problem.

However, if the new ReplicaSets ever become unsaturated after pre/post promotion had already started, and while the analysis is running, then these methods return true, causing:

  1. In-flight pre/post analysis to become cancelled
  2. The rollout progresses to the next stage, which is often completing the update prematurely

The checks are performed during every Rollout reconciliation. If by chance, the new ReplicaSet becomes unsaturated (e.g., due to normal node/pod churn), we will prematurely cause rollouts to promote.

To reproduce this:

  1. Create a blue-green rollout with either pre/post analysis. e.g., a Job analysis that sleeps for 5m and exits 1
  2. Perform an update of the rollout and wait until analysis runs
  3. Before analysis completes, delete a pod of the newRS
  4. The newRS becomes unsaturated, the analysis is terminated, and the rollout update completes without the analysis actually successfully completing.

This change fixes the issue by not considering pod saturation of the new ReplicaSet if we had already started analysis.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@jessesuen jessesuen changed the title fix: bluegreen analysis might premematurely succeed due to new ReplicaSet unsaturation fix: bluegreen analysis prematurely succeeds if new ReplicaSet becomes unsaturated Jan 28, 2026
…s unsaturated

Signed-off-by: Jesse Suen <jesse@akuity.io>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 28, 2026

Published E2E Test Results

  4 files    4 suites   3h 26m 37s ⏱️
117 tests 107 ✅  7 💤 3 ❌
474 runs  440 ✅ 28 💤 6 ❌

For more details on these failures, see this check.

Results for commit 4696fce.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 28, 2026

Published Unit Test Results

2 391 tests   2 391 ✅  3m 4s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit 4696fce.

♻️ This comment has been updated with latest results.

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.38%. Comparing base (c6dfed2) to head (4696fce).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4604      +/-   ##
==========================================
- Coverage   84.40%   84.38%   -0.02%     
==========================================
  Files         164      164              
  Lines       18849    18855       +6     
==========================================
+ Hits        15909    15911       +2     
- Misses       2077     2079       +2     
- Partials      863      865       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Jesse Suen <jesse@akuity.io>
@jessesuen jessesuen force-pushed the fix/premature-promotion branch from 705f2cf to 6237800 Compare January 28, 2026 01:38
Signed-off-by: Jesse Suen <jesse@akuity.io>
@sonarqubecloud
Copy link

@zachaller zachaller merged commit 1a406c2 into argoproj:master Jan 30, 2026
31 of 32 checks passed
zachaller pushed a commit that referenced this pull request Jan 30, 2026
…s unsaturated (#4604)

* fix: bluegreen analysis prematurely succeeds if new ReplicaSet becomes unsaturated

Signed-off-by: Jesse Suen <jesse@akuity.io>

* fix: add unit tests

Signed-off-by: Jesse Suen <jesse@akuity.io>

* fix: function comment was wrong

Signed-off-by: Jesse Suen <jesse@akuity.io>

---------

Signed-off-by: Jesse Suen <jesse@akuity.io>
@zachaller zachaller added the cherry-pick-completed Used once we have cherry picked the PR to all requested releases label Jan 30, 2026
@jessesuen jessesuen deleted the fix/premature-promotion branch January 31, 2026 09:14
jessesuen added a commit that referenced this pull request Feb 13, 2026
…s unsaturated (#4604)

* fix: bluegreen analysis prematurely succeeds if new ReplicaSet becomes unsaturated

Signed-off-by: Jesse Suen <jesse@akuity.io>

* fix: add unit tests

Signed-off-by: Jesse Suen <jesse@akuity.io>

* fix: function comment was wrong

Signed-off-by: Jesse Suen <jesse@akuity.io>

---------

Signed-off-by: Jesse Suen <jesse@akuity.io>
@jessesuen
Copy link
Member Author

Fixed in 1.8.4 and upcoming 1.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick/release-1.8 cherry-pick/release-1.9 cherry-pick-completed Used once we have cherry picked the PR to all requested releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

In Blue-Green, Pre-Promotion jobs are getting terminated prematurely whenever application pod crashes and the whole rollout is marked successfull

2 participants