Skip to content

test: split Online DDL vreplication stress suite shard#98

Draft
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-eee7
Draft

test: split Online DDL vreplication stress suite shard#98
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-ci-performance-eee7

Conversation

@cursor
Copy link

@cursor cursor bot commented Mar 18, 2026

Description

I inspected the latest successful main runs in vitessio/vitess with gh:

  • unit_test.yml: run 23124032519
  • cluster_endtoend.yml: run 23246552588

Across those two workflows, the absolute slowest cluster shard in that successful run was ers_prs_newfeatures_heavy at 1434s, but I found open performance PRs in planetscale/vitess already targeting that exact hotspot (#95, #96, #97). To avoid duplicating active work, I moved to the next unresolved outlier in the same inspected run: Run endtoend tests on Cluster (onlineddl_vrepl_stress_suite) at 1432s.

For that unresolved shard, the upstream CI timings show:

  • total job time: 1432s
  • Setup MySQL: 955s
  • Run cluster endtoend test: 406s
  • slowest test in the shard: go/test/endtoend/onlineddl/vrepl_stress_suite.TestVreplStressSchemaChanges at 375.60s

The core problem is that the shard runs one very large top-level e2e test with dozens of schema-change stress cases in a single matrix entry, so the whole package stays serialized inside one job.

This change keeps the exact same end-to-end assertions and package, but splits that long top-level test into two balanced top-level tests and maps them to two independent CI shards:

  • onlineddl_vrepl_stress_suite_group1
  • onlineddl_vrepl_stress_suite_group2

That preserves coverage while letting GitHub Actions schedule the two halves in parallel. Based on the upstream subtest timings from the inspected successful run, the two groups are balanced at roughly 188s and 186s of test body time, which should cut this hotspot below the current ~24m outlier level and remove it as the slowest unresolved cluster shard.

Related Issue(s)

None.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No deployment impact. This only changes CI scheduling for an existing e2e package.

Validation

I rebuilt the Vitess binaries and then ran each new shard through test.go with local e2e dependencies installed:

source build.env
rm -rf "$VTDATAROOT"/vtroot_*
go run test.go -docker=false -skip-build -follow -shard onlineddl_vrepl_stress_suite_group1

source build.env
rm -rf "$VTDATAROOT"/vtroot_*
go run test.go -docker=false -skip-build -follow -shard onlineddl_vrepl_stress_suite_group2

Results:

  • onlineddl_vrepl_stress_suite_group1: PASS Package ... (4m52.575s)
  • onlineddl_vrepl_stress_suite_group2: PASS Package ... (4m25.16s)

AI Disclosure

This PR was authored with GPT-5 assistance, using upstream CI log analysis plus local validation.

Open in Web View Automation 

Signed-off-by: Cursor Agent <cursoragent@cursor.com>

Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant