Skip to content

ci: split the slowest reparent cluster e2e shard#95

Draft
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-2ea1
Draft

ci: split the slowest reparent cluster e2e shard#95
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-2ea1

Conversation

@cursor
Copy link

@cursor cursor bot commented Mar 16, 2026

Description

I looked at the latest non-skipped successful main runs for both CI workflows and found the slowest job across them in cluster_endtoend.yml: ers_prs_newfeatures_heavy from run 23057103687, which took about 24m17s. In that job, go run test.go -shard ers_prs_newfeatures_heavy was running three independent reparent e2e packages serially in a single matrix entry:

  • emergencyreparent (~574.6s from the CI log timestamps)
  • newfeaturetest (~266.4s)
  • plannedreparent (~342.7s)

The slowest unit-test job in the corresponding successful main run was only ~16m12s, so this cluster shard was the real outlier.

This PR splits that one shard into three separate cluster shards by updating test/config.json:

  • emergencyreparent_heavy
  • plannedreparent_heavy
  • newfeatures_heavy

That keeps the exact same test coverage, but lets GitHub Actions schedule the packages independently instead of forcing them to run back-to-back inside one long job. Based on the upstream timings above, this should remove the 24-minute outlier shard and bring the reparent coverage closer to the duration of each individual package. In the inspected workflow run, that would take the bottleneck from ~24m17s down to the next-slowest shard at ~22m42s, while also making future balancing simpler.

I also checked open PRs before making this change. The active CI-related PRs I found were about downgrade binaries and e2e teardown behavior, not this shard imbalance.

Related Issue(s)

None.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No deployment impact. This is a CI-only scheduling change.

AI Disclosure

This PR was written with AI assistance. I used AI to inspect the workflow timings, prepare the shard split, and draft this PR description.

Local validation

  • source build.env && go run test.go -docker=false -skip-build -dry-run -shard emergencyreparent_heavy
  • source build.env && go run test.go -docker=false -skip-build -dry-run -shard plannedreparent_heavy
  • source build.env && go run test.go -docker=false -skip-build -dry-run -shard newfeatures_heavy
  • source build.env && go run test.go -docker=false -skip-build -follow -shard plannedreparent_heavy ✅ (PASS Package ... plannedreparent in ~5m33s locally)
  • source build.env && go run test.go -docker=false -skip-build -follow -shard newfeatures_heavy hit an existing local MySQL 8.4 compatibility failure in TestSemiSyncBlockDueToDisruption (stop slave syntax), which is unrelated to the shard split itself.
Open in Web View Automation 

Signed-off-by: Cursor Agent <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant