Skip to content

test: system test load testing, short/extended CI split#26108

Open
aljo242 wants to merge 29 commits intomainfrom
test/load
Open

test: system test load testing, short/extended CI split#26108
aljo242 wants to merge 29 commits intomainfrom
test/load

Conversation

@aljo242
Copy link
Copy Markdown
Contributor

@aljo242 aljo242 commented Mar 17, 2026

  • Add LoadTestBroadcaster for in-process tx signing/broadcast (gRPC+RPC)
  • Add TestHeavyLoadLight (10k txs) and TestHeavyLoad (env-gated)
  • Split CI: short suite (~5-7min) on PRs, extended (~30min) on main
  • Add testing.Short() guards to slow tests (upgrade, stability, protocolpool)
  • Fix LoadTestBroadcaster gRPC connection leak (Close method)
  • Fix TPS division-by-zero when single block has txs
  • Add TestNodePauseResume short guard

- Add LoadTestBroadcaster for in-process tx signing/broadcast (gRPC+RPC)
- Add TestHeavyLoadLight (10k txs) and TestHeavyLoad (env-gated)
- Split CI: short suite (~5-7min) on PRs, extended (~30min) on main
- Add testing.Short() guards to slow tests (upgrade, stability, protocolpool)
- Fix LoadTestBroadcaster gRPC connection leak (Close method)
- Fix TPS division-by-zero when single block has txs
- Add TestNodePauseResume short guard

Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR adds a programmatic in-process load testing framework (LoadTestBroadcaster) and three new load test variants (Mini/Light/Heavy), then splits CI into a short suite (~5–7 min on PRs) and an extended suite (~30 min on merges to main) by adding testing.Short() guards to slow tests. The core abstractions are well-designed and the account-number caching with double-checked locking is correct, but there are a few issues worth addressing before merging:

  • Goroutine safety bug in runProgrammaticLoadTest: The progress-reporter goroutine is not awaited after close(done). If require.Greater fails (via runtime.Goexit) while the goroutine is executing t.Logf, Go's testing runtime will panic with t.Logf called after test finished. A sync.WaitGroup around the progress goroutine is needed.
  • Unbounded process spawning in TestHeavyLoad: Each batch launches 200 × 100 = 20,000 goroutines, each forking a simd subprocess, with no concurrency cap. This risks hitting OS PID limits and silently inflating skipped-tx counts. The runProgrammaticLoadTest path correctly uses a bounded worker pool — the same pattern should be applied here.
  • Quadratic log scanning in TestHeavyLoad: Each batch re-reads every node log file from offset 0 and scans the full content for "CONSENSUS FAILURE". As logs grow, this becomes O(batches × file_size) in I/O and will also re-fire the fatalf on every subsequent batch after a failure is first detected.
  • CI concurrency misconfiguration: The top-level cancel-in-progress: true concurrency group is shared between test-sdk-system (short, PR) and test-sdk-system-extended (long, main). Rapid pushes to main can cancel the 35-minute extended run before it completes, silently losing extended test signal.

Confidence Score: 2/5

  • Not safe to merge as-is — the goroutine leak in runProgrammaticLoadTest can cause test panics, and the CI concurrency config can silently suppress the extended suite on main.
  • Two issues can cause concrete failures in CI: (1) the progress goroutine can panic the test runner when an assertion fails after close(done), affecting both TestHeavyLoadMini and TestHeavyLoadLight; (2) the shared cancel-in-progress concurrency group means the extended suite can be killed before completing on every push to main. The unbounded-goroutine issue in TestHeavyLoad is gated behind an env var but is still a latent reliability problem.
  • tests/systemtests/load_test.go (goroutine safety, unbounded processes, log scanning) and .github/workflows/systemtests.yml (concurrency group configuration)

Important Files Changed

Filename Overview
tests/systemtests/load_test.go New load test file with three test variants (Mini/Light/Heavy). Contains a goroutine leak where the progress reporter can call t.Logf after the test finishes, unbounded goroutine/process spawning in TestHeavyLoad (20k concurrent subprocesses per batch), and a quadratic log file re-scan across batches.
.github/workflows/systemtests.yml Splits CI into short (PR) and extended (main) suites. The top-level cancel-in-progress: true shares one concurrency group across both jobs, meaning rapid pushes to main can cancel the 30-minute extended suite before it completes, silently losing test signal.
tools/systemtests/broadcast.go New in-process tx broadcaster for load tests. Correctly implements account number caching with double-checked locking, gRPC connection lifecycle (Close), and millisecond-scale unique timeout offsets for unordered txs. No critical issues found.
tools/systemtests/rpc_client.go Adds Block and TxByHash helpers to RPCClient, both delegating directly to the CometBFT client. Clean and straightforward additions with proper 0x-prefix hex stripping.
tools/systemtests/system.go Adds OutputDir and ChainID accessor methods to SystemUnderTest. Minimal, correct change.

Last reviewed commit: "fix(systemtests): co..."

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.86%. Comparing base (d6b77e2) to head (521b98a).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #26108      +/-   ##
==========================================
- Coverage   64.97%   64.86%   -0.11%     
==========================================
  Files         873      874       +1     
  Lines       57419    57559     +140     
==========================================
+ Hits        37306    37335      +29     
- Misses      20113    20224     +111     

see 4 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aljo242 aljo242 enabled auto-merge March 19, 2026 17:11
defer wg.Done()
for j := range jobs {
bc := broadcasters[j.nodeAddr]
txHash, code, err := bc.BroadcastBankSendUnordered(j.senderName, j.receiverAddr, "10stake", "1stake", j.idx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it would be nice to figure out how we can do regular txs, there is some overhead with unordered that has a non-negligible effect on block timing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah shiii maybe mix of both? if they all do not have nonce conflicts we could

aljo242 and others added 10 commits March 19, 2026 17:10
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
Co-authored-by: Tyler <48813565+technicallyty@users.noreply.github.com>
@aljo242 aljo242 added the backport/v0.54.x Backport PR's to release/v0.54.x branch label Mar 20, 2026
@aljo242
Copy link
Copy Markdown
Contributor Author

aljo242 commented Mar 20, 2026

@greptile re-review and give me a new score

Set account number and sequence for ordered broadcasts so BroadcastBankSend does not sign with default zero values, and cache unordered account numbers per sender to avoid redundant account lookups during load tests. Also align the mini load test comment with its actual tx count.

Made-with: Cursor
@aljo242
Copy link
Copy Markdown
Contributor Author

aljo242 commented Mar 20, 2026

@greptile i updated - can you re-review and give new score

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/v0.54.x Backport PR's to release/v0.54.x branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants