Skip to content

fix(bench): improve benchmark stability and output formatting#2597

Merged
iduartgomez merged 2 commits intomainfrom
claude/fix-benchmark-stability-faYpJ
Jan 5, 2026
Merged

fix(bench): improve benchmark stability and output formatting#2597
iduartgomez merged 2 commits intomainfrom
claude/fix-benchmark-stability-faYpJ

Conversation

@iduartgomez
Copy link
Copy Markdown
Collaborator

Fixes three issues with benchmark reports:

  1. ANSI escape codes in output - Cargo warnings with color codes were
    polluting markdown reports. Fixed by:

    • Adding strip_ansi_codes() to parse_bench_output.py
    • Using --color=never in both workflow files
    • Adding sed fallback stripping in workflow error handling
  2. Benchmark panic during warmup - ConnectionClosed errors in warmup
    phase caused benchmark crashes. Fixed by:

    • Replacing .unwrap() with graceful error handling in slow_start.rs
    • Making ConnectedPeerPair::warmup() return completion count instead
      of panicking on errors
  3. Extended benchmarks not running - transport_extended was missing from
    Cargo.toml [[bench]] configuration, causing cargo to run test harness
    instead of criterion. Fixed by adding proper bench configuration with
    harness = false.

@iduartgomez iduartgomez marked this pull request as draft January 4, 2026 23:41
This commit consolidates multiple improvements to the benchmark infrastructure:

Benchmark Stability Fixes:
- Use small warmup messages (1000 bytes) to avoid cwnd exhaustion
- Add proper error handling in all benchmark functions
- Increase measurement times to collect sufficient samples
- Fix doc comment warnings (/// -> //) on criterion macros

Parser Improvements (parse_bench_output.py):
- Fix time extraction to properly get middle value with unit
- Add benchmark name validation (must contain '/')
- Filter out criterion output lines from benchmark name detection
- Add severe regression detection (>25% on throughput benchmarks)
- Exit code 2 for severe regressions, 1 for regular regressions

CI Workflow Updates:
- Add "Fail on Severe Regressions" step for nightly/main builds
- Parse exit codes to distinguish severity levels
- PRs get informational comments only, not failures

New Benchmarks:
- Add RTT scenario benchmarks (0ms, 5ms, 10ms, 25ms, 50ms delays)

Removed:
- transport_direct.rs (redundant after criterion fixes)
@iduartgomez iduartgomez force-pushed the claude/fix-benchmark-stability-faYpJ branch from 403e0ce to 337a18a Compare January 5, 2026 16:32
Document known gaps in transport benchmark coverage that could allow
regressions to go undetected:

Critical gaps:
- Memory usage tracking (none)
- Bidirectional simultaneous transfer (none)
- Connection churn / many short connections (none)
- High RTT paths 100ms+ (partial - only up to 50ms)

Medium priority gaps:
- Timeout/retransmission behavior (indirect only)
- Multiple concurrent connections from different peers
- Backpressure / flow control behavior

Also documents potential for virtual time benchmarks using Criterion's
custom Measurement trait, which could make high-RTT and timeout testing
10-100x faster.
@iduartgomez iduartgomez marked this pull request as ready for review January 5, 2026 16:54
@iduartgomez iduartgomez merged commit 7a4bb93 into main Jan 5, 2026
8 of 9 checks passed
@iduartgomez iduartgomez deleted the claude/fix-benchmark-stability-faYpJ branch January 5, 2026 16:55
@sanity sanity mentioned this pull request Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants