|
| 1 | +--- |
| 2 | +name: add-benchmark |
| 3 | +description: > |
| 4 | + Guide for adding a new benchmark or training environment to NeMo-Gym. |
| 5 | + Use when the user asks to add, create, or integrate a benchmark, evaluation, |
| 6 | + training environment, or resource server into NeMo-Gym. Also use when wrapping |
| 7 | + an existing 3rd-party benchmark library. Covers the full workflow: data preparation, |
| 8 | + resource server implementation, agent wiring, YAML config, testing, and reward |
| 9 | + profiling (baselining). Triggered by: "add benchmark", "new resource server", |
| 10 | + "integrate benchmark", "wrap benchmark", "add training environment", "add eval". |
| 11 | +--- |
| 12 | + |
| 13 | +# Add Benchmark to NeMo-Gym |
| 14 | + |
| 15 | +## Determine Integration Type |
| 16 | + |
| 17 | +Before starting, determine which type of benchmark you're adding: |
| 18 | + |
| 19 | +**Native benchmark** — verification logic implemented directly in a Gym resource server: |
| 20 | +- Resource server implements `verify()` with reward logic |
| 21 | +- Agent server orchestrates model calls (use `simple_agent` for single-turn, or custom agent for multi-turn) |
| 22 | +- Example: `code_gen`, `instruction_following`, `math_with_judge` |
| 23 | + |
| 24 | +**External benchmark** — wrapping a 3rd-party library that has its own orchestration: |
| 25 | +- Integrate at the agent server level (not resource server) |
| 26 | +- Agent's `/run` endpoint wraps the external library |
| 27 | +- Pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse` |
| 28 | +- Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration |
| 29 | +- Add the dependency in `requirements.txt` |
| 30 | + |
| 31 | +## Workflow |
| 32 | + |
| 33 | +### Step 1: Scaffold the server |
| 34 | + |
| 35 | +Run `ng_init_resources_server` to generate the directory structure: |
| 36 | + |
| 37 | +```bash |
| 38 | +ng_init_resources_server +entrypoint=resources_servers/my_benchmark |
| 39 | +``` |
| 40 | + |
| 41 | +This creates: |
| 42 | +``` |
| 43 | +resources_servers/my_benchmark/ |
| 44 | +├── app.py # Server template |
| 45 | +├── configs/my_benchmark.yaml |
| 46 | +├── data/.gitignore |
| 47 | +├── tests/test_app.py |
| 48 | +├── requirements.txt |
| 49 | +└── README.md |
| 50 | +``` |
| 51 | + |
| 52 | +For external benchmarks, create the agent server manually under `responses_api_agents/my_agent/` with the same structure. |
| 53 | + |
| 54 | +### Step 2: Prepare data |
| 55 | + |
| 56 | +Convert your source dataset to Gym JSONL format. Each line must have `responses_create_params.input` (OpenAI message format). Task-specific verification data goes in `verifier_metadata`. |
| 57 | + |
| 58 | +```json |
| 59 | +{ |
| 60 | + "responses_create_params": { |
| 61 | + "input": [ |
| 62 | + {"role": "system", "content": "System prompt"}, |
| 63 | + {"role": "user", "content": "Problem statement"} |
| 64 | + ] |
| 65 | + }, |
| 66 | + "verifier_metadata": { |
| 67 | + "test_cases": [{"input": "...", "expected_output": "..."}], |
| 68 | + "task_id": "unique_id" |
| 69 | + } |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +**Data conversion**: Write conversion scripts in the **source repo** (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See `references/patterns.md` § "Data Conversion Script Pattern". |
| 74 | + |
| 75 | +**`example.jsonl`**: Generate 5 entries for smoke testing. This file is committed directly to git in `data/example.jsonl`. |
| 76 | + |
| 77 | +**`train`/`validation` datasets**: Upload to the GitLab dataset registry — these must NOT be committed to git. |
| 78 | + |
| 79 | +```bash |
| 80 | +ng_upload_dataset_to_gitlab \ |
| 81 | + +dataset_name=my_benchmark \ |
| 82 | + +version=0.0.1 \ |
| 83 | + +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl |
| 84 | +``` |
| 85 | + |
| 86 | +Requires MLflow credentials in `env.yaml` (or passed via CLI): |
| 87 | +```yaml |
| 88 | +mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri> |
| 89 | +mlflow_tracking_token: <your-gitlab-api-token> |
| 90 | +``` |
| 91 | +
|
| 92 | +**`data/.gitignore`**: The scaffold generates default patterns (`*train.jsonl`, `*validation.jsonl`, etc.). If your filename doesn't match (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached <file>`. |
| 93 | + |
| 94 | +**Validate** your data: |
| 95 | +```bash |
| 96 | +# Validate example data (for PR submission) |
| 97 | +ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \ |
| 98 | + +output_dirpath=/tmp/prepare +mode=example_validation |
| 99 | +
|
| 100 | +# Download and prepare train/validation from GitLab |
| 101 | +ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \ |
| 102 | + +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab |
| 103 | +``` |
| 104 | + |
| 105 | +### Step 3: Implement verify() |
| 106 | + |
| 107 | +Edit `app.py`. The `verify()` method receives model output + `verifier_metadata`, returns reward. |
| 108 | + |
| 109 | +For code execution benchmarks, see `references/patterns.md` § "Subprocess Execution with Ray" and "Resource Server Pattern". |
| 110 | + |
| 111 | +Critical rules: |
| 112 | +- Return `reward` as 0.0 or 1.0 (binary) |
| 113 | +- Handle empty/missing model output gracefully — return 0.0, don't crash |
| 114 | +- Must handle 4k-65k concurrent requests without crashing |
| 115 | +- Use `asyncio.Semaphore` for subprocess concurrency control |
| 116 | +- For Ray remote tasks: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` in async context. |
| 117 | +- Decode subprocess output with `errors="replace"` |
| 118 | +- Strip `<think>`/`<thinking>` blocks before parsing model output (thinking models emit these) |
| 119 | +- Tests should `pytest.mark.skipif` when external tools aren't installed |
| 120 | +- If the benchmark auto-installs its tool (see Step 3b), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` evaluates at import time, before fixtures run |
| 121 | + |
| 122 | +### Step 3b: Auto-install external tools (if applicable) |
| 123 | + |
| 124 | +If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See `references/patterns.md` § "External Tool Auto-Install Pattern". |
| 125 | + |
| 126 | +Key points: |
| 127 | +- Create `setup_<tool>.py` with `ensure_<tool>()` — checks PATH, forks on `sys.platform` (brew on macOS, build from source on Linux) |
| 128 | +- Call it in `model_post_init()` before semaphore init |
| 129 | +- Build scripts should be idempotent and install into a local gitignored prefix |
| 130 | +- Add a `pytest_configure` hook in `tests/conftest.py` that calls `ensure_<tool>()` before collection |
| 131 | + |
| 132 | +### Step 4: Wire YAML config |
| 133 | + |
| 134 | +Edit `configs/my_benchmark.yaml`. Define the resource server instance and agent pairing(s). See `references/patterns.md` § "YAML Config Pattern". |
| 135 | + |
| 136 | +Key points: |
| 137 | +- `verified: false` is auto-added by pre-commit hook (set to `true` after baselining) |
| 138 | +- `license` is required for `train` and `validation` datasets |
| 139 | +- Agent references resource server and model server by instance name |
| 140 | + |
| 141 | +For multi-turn benchmarks, either use `proof_refinement_agent` or create a custom agent. See `references/patterns.md` § "Agent Patterns". |
| 142 | + |
| 143 | +For `train`/`validation` datasets, add `gitlab_identifier` alongside `jsonl_fpath`: |
| 144 | +```yaml |
| 145 | +datasets: |
| 146 | +- name: my_dataset |
| 147 | + type: train |
| 148 | + jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl |
| 149 | + gitlab_identifier: |
| 150 | + dataset_name: my_benchmark |
| 151 | + version: 0.0.1 |
| 152 | + artifact_fpath: my_dataset.jsonl |
| 153 | + license: MIT |
| 154 | +- name: example |
| 155 | + type: example |
| 156 | + jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl |
| 157 | +``` |
| 158 | + |
| 159 | +Both fields must coexist: `jsonl_fpath` is the local download destination, `gitlab_identifier` tells the system where to fetch from. `example` datasets don't need `gitlab_identifier` — they're committed to git directly. |
| 160 | + |
| 161 | +### Step 5: Test |
| 162 | + |
| 163 | +```bash |
| 164 | +# Run server tests (creates isolated .venv, slow on first run) |
| 165 | +ng_test +entrypoint=resources_servers/my_benchmark |
| 166 | +
|
| 167 | +# Run core library tests to check nothing broke |
| 168 | +pytest tests/unit_tests/ -x |
| 169 | +``` |
| 170 | + |
| 171 | +Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout. |
| 172 | + |
| 173 | +### Step 6: Smoke test end-to-end |
| 174 | + |
| 175 | +```bash |
| 176 | +# Start servers |
| 177 | +ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]" |
| 178 | +
|
| 179 | +# Quick test with example data |
| 180 | +ng_collect_rollouts +agent_name=my_benchmark_simple_agent \ |
| 181 | + +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \ |
| 182 | + +output_jsonl_fpath=results/example_rollouts.jsonl \ |
| 183 | + +num_repeats=1 \ |
| 184 | + "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" |
| 185 | +
|
| 186 | +# Inspect results |
| 187 | +``` |
| 188 | + |
| 189 | +### Step 7: Baseline (reward profiling) |
| 190 | + |
| 191 | +Run against multiple models to validate correctness. Recommended suite: |
| 192 | +- Your policy model of interest |
| 193 | +- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct) |
| 194 | +- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking) |
| 195 | +- At least one closed-source model (e.g. GPT-5 Nano or GPT-5) |
| 196 | + |
| 197 | +```bash |
| 198 | +# Collect rollouts |
| 199 | +ng_collect_rollouts +agent_name=my_benchmark_simple_agent \ |
| 200 | + +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \ |
| 201 | + +output_jsonl_fpath=results/rollouts.jsonl \ |
| 202 | + +num_repeats=5 \ |
| 203 | + "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" |
| 204 | +
|
| 205 | +# Compute per-task pass rates |
| 206 | +ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \ |
| 207 | + +rollouts_jsonl_fpath=results/rollouts.jsonl \ |
| 208 | + +output_jsonl_fpath=results/profiled.jsonl \ |
| 209 | + +pass_threshold=1.0 |
| 210 | +
|
| 211 | +# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward) |
| 212 | +python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl |
| 213 | +``` |
| 214 | + |
| 215 | +Increase `num_repeats` until variance < 1% across runs on the same model. |
| 216 | + |
| 217 | +Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers. |
| 218 | + |
| 219 | +For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match. |
| 220 | + |
| 221 | +### Step 8: Pre-commit and PR |
| 222 | + |
| 223 | +```bash |
| 224 | +pre-commit run --all-files |
| 225 | +``` |
| 226 | + |
| 227 | +First run may fail as hooks auto-modify files (`verified: false` flag, README table). Stage changes and run again. |
| 228 | + |
| 229 | +Set `verified: true` in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description. |
| 230 | + |
| 231 | +To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files: |
| 232 | +```bash |
| 233 | +pre-commit run --files resources_servers/my_benchmark/**/* |
| 234 | +``` |
| 235 | +If hooks modify files in other directories, discard those changes: |
| 236 | +```bash |
| 237 | +git checkout -- resources_servers/other_server/ |
| 238 | +``` |
| 239 | + |
| 240 | +## Constraints |
| 241 | + |
| 242 | +- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other |
| 243 | +- Pass configuration through Gym config (YAML), not environment variables |
| 244 | +- Code must run on Linux |
| 245 | +- `/run` endpoint must be async |
| 246 | +- Errors from tool execution or bad model output must return error responses, not crash |
| 247 | +- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`) |
| 248 | + |
| 249 | +## Reference |
| 250 | + |
| 251 | +For detailed code patterns, schemas, and examples: see [references/patterns.md](references/patterns.md). |
0 commit comments