ci: split the slowest reparent cluster e2e shard#95
Draft
cursor[bot] wants to merge 1 commit intomainfrom
Draft
Conversation
Signed-off-by: Cursor Agent <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
I looked at the latest non-skipped successful
mainruns for both CI workflows and found the slowest job across them incluster_endtoend.yml:ers_prs_newfeatures_heavyfrom run23057103687, which took about 24m17s. In that job,go run test.go -shard ers_prs_newfeatures_heavywas running three independent reparent e2e packages serially in a single matrix entry:emergencyreparent(~574.6s from the CI log timestamps)newfeaturetest(~266.4s)plannedreparent(~342.7s)The slowest unit-test job in the corresponding successful
mainrun was only ~16m12s, so this cluster shard was the real outlier.This PR splits that one shard into three separate cluster shards by updating
test/config.json:emergencyreparent_heavyplannedreparent_heavynewfeatures_heavyThat keeps the exact same test coverage, but lets GitHub Actions schedule the packages independently instead of forcing them to run back-to-back inside one long job. Based on the upstream timings above, this should remove the 24-minute outlier shard and bring the reparent coverage closer to the duration of each individual package. In the inspected workflow run, that would take the bottleneck from ~24m17s down to the next-slowest shard at ~22m42s, while also making future balancing simpler.
I also checked open PRs before making this change. The active CI-related PRs I found were about downgrade binaries and e2e teardown behavior, not this shard imbalance.
Related Issue(s)
None.
Checklist
Deployment Notes
No deployment impact. This is a CI-only scheduling change.
AI Disclosure
This PR was written with AI assistance. I used AI to inspect the workflow timings, prepare the shard split, and draft this PR description.
Local validation
source build.env && go run test.go -docker=false -skip-build -dry-run -shard emergencyreparent_heavysource build.env && go run test.go -docker=false -skip-build -dry-run -shard plannedreparent_heavysource build.env && go run test.go -docker=false -skip-build -dry-run -shard newfeatures_heavysource build.env && go run test.go -docker=false -skip-build -follow -shard plannedreparent_heavy✅ (PASS Package ... plannedreparentin ~5m33s locally)source build.env && go run test.go -docker=false -skip-build -follow -shard newfeatures_heavyhit an existing local MySQL 8.4 compatibility failure inTestSemiSyncBlockDueToDisruption(stop slavesyntax), which is unrelated to the shard split itself.