A GitHub Action for running SWE-bench Pro preflight validation and agent evaluation, powered by mcpbr.
SWE-bench Pro is Scale AI's multi-language software engineering benchmark: 1,865 task instances across 41 repositories in Python, Go, JavaScript, and TypeScript. This action lets you validate golden patches (preflight) and run agent evaluations against those instances directly in CI.
Validate that golden patches pass their test suites before running agent evaluations:
- uses: greynewell/swe-bench-pro-action@v1
with:
mode: preflight
sample-size: "5"Run your MCP agent against SWE-bench Pro instances:
- uses: greynewell/swe-bench-pro-action@v1
with:
mode: evaluate
config: mcpbr.yaml
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
sample-size: "10"| Input | Default | Description |
|---|---|---|
mode |
preflight |
preflight (validate golden patches) or evaluate (run agent) |
benchmark |
swe-bench-pro |
Benchmark name |
sample-size |
(all) | Number of instances to evaluate |
task-ids |
(empty) | Comma-separated instance IDs |
filter-category |
(empty) | Filter by language or repo substring |
max-concurrent |
2 |
Max concurrent Docker containers |
timeout |
300 |
Per-test timeout in seconds |
fail-fast |
false |
Stop on first failure |
config |
(empty) | Path to mcpbr YAML config (required for evaluate) |
anthropic-api-key |
(empty) | Anthropic API key (evaluate mode) |
model |
(empty) | Model override |
output-format |
json,junit |
Comma-separated: json, junit, markdown, html |
mcpbr-version |
(latest) | Pin a specific mcpbr version |
| Output | Description |
|---|---|
results-path |
Path to the results directory |
total |
Total instances evaluated |
passed |
Number passed |
failed |
Number failed |
success-rate |
Success rate percentage |
Run preflight on every push to catch environment issues early:
name: SWE-bench Preflight
on: [push, pull_request]
jobs:
preflight:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: false
- name: Run SWE-bench preflight
uses: greynewell/swe-bench-pro-action@v1
with:
mode: preflight
sample-size: "3"
fail-fast: "true"Run a full evaluation on a schedule:
name: SWE-bench Evaluation
on:
schedule:
- cron: "0 2 * * *" # 2 AM UTC daily
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: false
- name: Run evaluation
id: eval
uses: greynewell/swe-bench-pro-action@v1
with:
mode: evaluate
config: mcpbr.yaml
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
sample-size: "20"
max-concurrent: "4"
output-format: "json,junit,markdown"
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: swe-bench-results
path: ${{ steps.eval.outputs.results-path }}
- name: Check success rate
run: |
echo "Success rate: ${{ steps.eval.outputs.success-rate }}%"
echo "Passed: ${{ steps.eval.outputs.passed }}/${{ steps.eval.outputs.total }}"Evaluate only Python instances:
- uses: greynewell/swe-bench-pro-action@v1
with:
mode: preflight
filter-category: python
sample-size: "10"Run specific instances:
- uses: greynewell/swe-bench-pro-action@v1
with:
mode: preflight
task-ids: "django__django-16046, scikit-learn__scikit-learn-25638"- Runner:
ubuntu-latest(x86_64). ARM64 runners are not supported due to SWE-bench container compatibility. - Docker: The runner must have Docker available. GitHub-hosted runners include Docker by default.
- Disk space: SWE-bench images are large. Free disk space before running (see below).
- API key (evaluate mode only): An Anthropic API key passed via
secrets.
SWE-bench Docker images are large. On GitHub-hosted runners, use jlumbroso/free-disk-space to reclaim ~30GB:
- uses: jlumbroso/free-disk-space@main
with:
tool-cache: false # keep tool cache for faster builds| Runner Type | Recommended max-concurrent |
Notes |
|---|---|---|
| Free (ubuntu-latest) | 2 | 2 vCPU, 7 GB RAM |
| Standard (4-core) | 4 | 4 vCPU, 16 GB RAM |
| Large (8-core) | 6-8 | 8 vCPU, 32 GB RAM |
This action runs as a Docker container on the GitHub Actions runner. It uses the host's Docker daemon (via socket mount) to create sibling containers for SWE-bench instances:
GitHub Runner (ubuntu-latest, x86_64)
├── Docker Daemon (native)
├── Action Container (mcpbr + Docker CLI)
│ └── /var/run/docker.sock (auto-mounted)
├── SWE-bench Container 1 (sibling)
└── SWE-bench Container 2 (sibling)
Running on x86_64 runners avoids ARM64/QEMU compatibility issues with Go, JavaScript, and TypeScript SWE-bench Pro instances.
See CONTRIBUTING.md for development setup, testing, and submission guidelines.
MIT