Skip to content

test: split onlineddl vrepl suite shard#99

Draft
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-test-optimization-3933
Draft

test: split onlineddl vrepl suite shard#99
cursor[bot] wants to merge 1 commit intomainfrom
cursor/vitess-test-optimization-3933

Conversation

@cursor
Copy link

@cursor cursor bot commented Mar 19, 2026

Description

I inspected the latest successful main runs in vitessio/vitess using gh:

  • unit_test.yml: run 23294293582
  • cluster_endtoend.yml: run 23294293585

Across those two workflows, the slowest job was Run endtoend tests on Cluster (onlineddl_vrepl_suite) at 34m22s. The slowest unit-test job in the same investigation window was only 15m38s, so this cluster shard was the real bottleneck.

From the upstream job log for onlineddl_vrepl_suite:

  • Setup MySQL: 15m27s
  • Run cluster endtoend test: 17m51s
  • package time: go/test/endtoend/onlineddl/vrepl_suite in 17m20.031s
  • slowest top-level test: TestVreplSuiteSchemaChanges in 1023.01s

The root cause is that go/test/endtoend/onlineddl/vrepl_suite puts 129 schema-change cases behind a single top-level e2e test, so all of that work is serialized inside one shard.

I also checked open PRs in both vitessio/vitess and planetscale/vitess before making this change. There is active draft work for onlineddl_vrepl_stress_suite, but nothing open for onlineddl_vrepl_suite itself.

This change keeps the same end-to-end coverage but splits TestVreplSuiteSchemaChanges into two balanced top-level tests and maps them to two independent CI shards:

  • onlineddl_vrepl_suite_group1
  • onlineddl_vrepl_suite_group2

Using the upstream per-subtest timings from the inspected successful run, the two groups balance almost exactly evenly at 510.94s and 511.03s of test body time. That should reduce the current ~34m outlier shard to roughly ~24m per shard in CI, which brings it below the current vtorc hotspot while preserving the existing assertions.

Related Issue(s)

None.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No deployment impact. This only changes CI scheduling for an existing end-to-end suite.

AI Disclosure

This PR was authored with GPT-5 assistance. I used AI to inspect the upstream CI timings, prepare the shard split, run the local validation, and draft this PR description.

Local validation

After building Vitess and installing the same local e2e dependencies the cluster workflow expects, I ran both new shards through the Vitess e2e harness with a clean VTDATAROOT before each run:

source build.env
rm -rf "$VTDATAROOT"/vtroot_*
go run test.go -docker=false -skip-build -follow -shard onlineddl_vrepl_suite_group1

source build.env
rm -rf "$VTDATAROOT"/vtroot_*
go run test.go -docker=false -skip-build -follow -shard onlineddl_vrepl_suite_group2

Results:

  • onlineddl_vrepl_suite_group1: PASS Package ... (8m45.028s) / local.onlineddl_vrepl_suite_group1: PASSED in 8m46.9s
  • onlineddl_vrepl_suite_group2: PASS Package ... (8m49.361s) / local.onlineddl_vrepl_suite_group2: PASSED in 8m51.6s
Open in Web View Automation 

Signed-off-by: Cursor Agent <cursoragent@cursor.com>

Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant