ci: split the slowest reparent cluster e2e shard by cursor[bot] · Pull Request #95 · planetscale/vitess

cursor · 2026-03-16T21:23:20Z

Description

I looked at the latest non-skipped successful main runs for both CI workflows and found the slowest job across them in cluster_endtoend.yml: ers_prs_newfeatures_heavy from run 23057103687, which took about 24m17s. In that job, go run test.go -shard ers_prs_newfeatures_heavy was running three independent reparent e2e packages serially in a single matrix entry:

emergencyreparent (~574.6s from the CI log timestamps)
newfeaturetest (~266.4s)
plannedreparent (~342.7s)

The slowest unit-test job in the corresponding successful main run was only ~16m12s, so this cluster shard was the real outlier.

This PR splits that one shard into three separate cluster shards by updating test/config.json:

emergencyreparent_heavy
plannedreparent_heavy
newfeatures_heavy

That keeps the exact same test coverage, but lets GitHub Actions schedule the packages independently instead of forcing them to run back-to-back inside one long job. Based on the upstream timings above, this should remove the 24-minute outlier shard and bring the reparent coverage closer to the duration of each individual package. In the inspected workflow run, that would take the bottleneck from ~24m17s down to the next-slowest shard at ~22m42s, while also making future balancing simpler.

I also checked open PRs before making this change. The active CI-related PRs I found were about downgrade binaries and e2e teardown behavior, not this shard imbalance.

Related Issue(s)

None.

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

No deployment impact. This is a CI-only scheduling change.

AI Disclosure

This PR was written with AI assistance. I used AI to inspect the workflow timings, prepare the shard split, and draft this PR description.

Local validation

source build.env && go run test.go -docker=false -skip-build -dry-run -shard emergencyreparent_heavy
source build.env && go run test.go -docker=false -skip-build -dry-run -shard plannedreparent_heavy
source build.env && go run test.go -docker=false -skip-build -dry-run -shard newfeatures_heavy
source build.env && go run test.go -docker=false -skip-build -follow -shard plannedreparent_heavy ✅ (PASS Package ... plannedreparent in ~5m33s locally)
source build.env && go run test.go -docker=false -skip-build -follow -shard newfeatures_heavy hit an existing local MySQL 8.4 compatibility failure in TestSemiSyncBlockDueToDisruption (stop slave syntax), which is unrelated to the shard split itself.

Signed-off-by: Cursor Agent <cursoragent@cursor.com>

ci: split slow reparent e2e shard

3c49f16

Signed-off-by: Cursor Agent <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: split the slowest reparent cluster e2e shard#95

ci: split the slowest reparent cluster e2e shard#95
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-2ea1

cursor bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cursor bot commented Mar 16, 2026

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Local validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant