Skip to content

test: wait for readiness instead of sleeping in TestTrivialERS#96

Draft
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-358f
Draft

test: wait for readiness instead of sleeping in TestTrivialERS#96
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-358f

Conversation

@cursor
Copy link

@cursor cursor bot commented Mar 17, 2026

Description

I inspected the latest successful main runs for both unit_test.yml and cluster_endtoend.yml in vitessio/vitess and found the cluster workflow was the real wall-clock bottleneck. In run 23057103687, the slowest job was ers_prs_newfeatures_heavy at about 24m17s, noticeably slower than any job in the latest successful unit-test run.

Within that shard, go/test/endtoend/reparent/emergencyreparent was a major contributor at about 8m21s, and TestTrivialERS alone took about 62.97s. The biggest avoidable cost in that test was eight unconditional time.Sleep(5 * time.Second) calls between repeated ERS invocations.

This PR replaces those fixed sleeps with a readiness check that waits for:

  • exactly one healthy primary,
  • a valid topology state, and
  • working replication to all replicas.

That keeps the same end-to-end assertion intact — repeated ERS operations must leave the shard healthy and replicating — while removing idle wait time.

I also checked open PRs before making this change. I did not find an open PR already addressing TestTrivialERS or the ers_prs_newfeatures_heavy shard specifically; the closest related open PR I saw was #19622, which focuses on e2e teardown/orphaned processes rather than this test’s fixed waits.

Local validation:

  • go test -count=1 -timeout 15m -run '^TestTrivialERS$' vitess.io/vitess/go/test/endtoend/reparent/emergencyreparent
    • before: 61.383s
    • after: 31.282s
  • go test -count=1 -timeout 25m vitess.io/vitess/go/test/endtoend/reparent/emergencyreparent
    • after: 453.887s (7m33.887s)
    • reference CI package time from the inspected run: about 8m21.062s

Related Issue(s)

None.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

None.

AI Disclosure

This PR was written primarily by GPT-5.

Open in Web View Automation 

Signed-off-by: Cursor Agent <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant