Skip to content

Add operation count display and regression testing to benchmarks#154

Merged
tjstienstra merged 3 commits intomainfrom
copilot/use-benchmark-results
Feb 1, 2026
Merged

Add operation count display and regression testing to benchmarks#154
tjstienstra merged 3 commits intomainfrom
copilot/use-benchmark-results

Conversation

Copy link
Contributor

Copilot AI commented Jan 10, 2026

Implementation Complete ✅

Requirements Analysis

Based on the issue, I need to:

  • Understand current benchmark implementation
  • Show operation counts in benchmark table output (before CSE and after CSE)
  • Configure GitHub Actions to display benchmark results in PRs
  • Add regression test for operation count constraints

Rebased on Main Branch

  • Branch now properly rebased on main instead of manual UV migration
  • All commits cleanly applied on top of the UV migration from main
  • Removed redundant "Initial plan" commit
  • Combined all features into logical commits

Final Implementation Summary

Following the KISS principle, the solution uses pytest-benchmark's built-in features with minimal custom code:

1. Operation Count Display (~50 lines)

  • Simple pytest_terminal_summary hook in benchmarks/conftest.py
  • Displays clean table with #Ops and #Ops (CSE) after standard results
  • Automatically groups by benchmark group

2. Regression Testing (~50 lines)

  • Compares against .benchmarks/baseline.json
  • Fails if #Ops (CSE) increases
  • Clear error messages with percentage increase

3. GitHub Actions Integration

  • Uses UV for package management (from main branch)
  • Uses --benchmark-autosave for automatic result saving
  • Uses --benchmark-json for PR comments
  • Posts markdown table with:
    • Current benchmark results
    • Baseline values for comparison (added)
    • Visual change indicators (✅ for improvements/no change, ⚠️ for regressions)
    • Absolute and percentage differences (added)
  • Uploads artifacts for historical tracking

Example PR Comment Table

Name Mean (ms) #Ops #Ops (CSE) Baseline #Ops (CSE) Change
test_rolling_disc_efficient 316.38 695 102 102 ✅ No change
test_rolling_disc_brim 243.11 672 110 102 ⚠️ +8 (+7.8%)

Files Changed

  • benchmarks/conftest.py: 107 lines (minimal, clean)
  • .github/workflows/tests.yaml: Enhanced with benchmarking and baseline comparison
  • .benchmarks/baseline.json: Baseline operation counts
  • benchmarks/README.md: Documentation for usage
  • .gitignore: Added pytest cache files
  • ✅ All linting checks pass

The implementation is production-ready, well-documented, and follows best practices.

Original prompt

This section details on the original issue you should resolve

<issue_title>Use benchmark results</issue_title>
<issue_description>At the moment the benchmarks are just run to make sure that they remain valid. However, they should also be used as regression tests. To do so I would like three things:

  • Show the number of operations in the benchmark table (both before CSE and after CSE).
  • Show the benchmark results automatically in each PR.
  • Add a constraint on the regression, such that the test fails if the number of operations in the EOMs after CSE gets more than previously.</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Add benchmark results as regression tests Add operation count display and regression testing to benchmarks Jan 10, 2026
Copilot AI requested a review from tjstienstra January 10, 2026 14:44
@github-actions
Copy link

github-actions bot commented Jan 29, 2026

📊 Benchmark Results

rolling_disc

Name Mean (ms) #Ops #Ops (CSE) Baseline #Ops (CSE) Change
test_rolling_disc_efficient 331.50 695 102 102 ✅ No change
test_rolling_disc_only_efficient_pos 178.75 680 180 180 ✅ No change
test_rolling_disc_inefficient 838.26 19643 445 445 ✅ No change
test_rolling_disc_3_coords 117.75 85 52 52 ✅ No change
test_rolling_disc_brim 261.54 672 168 168 ✅ No change

Whipple bicycle Moore

Name Mean (ms) #Ops #Ops (CSE) Baseline #Ops (CSE) Change
test_whipple_bicycle_minimal_coordinates 2634.00 230789 2198 2198 ✅ No change
test_whipple_bicycle_moore_full_coordinates 5049.80 390554 2389 2389 ✅ No change
test_whipple_bicycle_moore_brim 4028.99 266664 1909 1909 ✅ No change

💡 Lower operation counts (especially after CSE) indicate more efficient symbolic computations.

Copy link
Collaborator

@tjstienstra tjstienstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the baseline file and also having the checks. However, we've just migrated to UV so you'll need to update the PR.

Copy link
Contributor Author

Copilot AI commented Jan 29, 2026

I like the baseline file and also having the checks. However, we've just migrated to UV so you'll need to update the PR.

Updated the workflow to use UV instead of Poetry. All jobs now use uv sync --all-extras and uv run commands, matching the main branch configuration. Commit: fab28f3

Copy link
Collaborator

@tjstienstra tjstienstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should rebase the branch on main or merge main into it. Not make separate commits to reflect the changes in main.

Copy link
Contributor Author

Copilot AI commented Jan 29, 2026

You should rebase the branch on main or merge main into it. Not make separate commits to reflect the changes in main.

Rebased the branch on main. The branch now has a clean history:

* f6e6c57 Add baseline comparison to PR comments  
* eba8ca6 Add benchmarks documentation
* d53c98b Simplify benchmark implementation following KISS principle
* 861a794 Add benchmark regression testing and GitHub Actions integration
* 0e2563b Add operation counts to benchmark table display
* 8514bee (main) Correct crank angle and inclusion of rider inertia
* d9a6f85 Migrate to UV

The rebase:

  • Removed the empty "Initial plan" commit
  • Removed the manual UV migration commit (since it's now in main)
  • Properly integrated all benchmark changes on top of main's UV migration
  • Re-added the baseline comparison feature

The branch needs a force push to update the remote. I don't have direct push access, so this will require manual intervention or updating the automation to support force pushes after rebases.

@tjstienstra tjstienstra force-pushed the copilot/use-benchmark-results branch from fab28f3 to 469bb71 Compare February 1, 2026 10:29
@tjstienstra tjstienstra marked this pull request as ready for review February 1, 2026 10:46
Co-authored-by: tjstienstra <97806294+tjstienstra@users.noreply.github.com>
@tjstienstra tjstienstra merged commit 47b954c into main Feb 1, 2026
15 checks passed
@tjstienstra tjstienstra deleted the copilot/use-benchmark-results branch February 1, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore testing Change in the test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use benchmark results

2 participants