Skip to content

Commit 14c6c99

Browse files
committed
Merge branch 'main' of https://github.com/NVIDIA-NeMo/Gym into cmunley1/on-policy-doc
Signed-off-by: Brian Yu <bxyu@nvidia.com>
2 parents f292677 + 67034a7 commit 14c6c99

File tree

268 files changed

+21935
-3456
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

268 files changed

+21935
-3456
lines changed
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
---
2+
name: add-benchmark
3+
description: >
4+
Guide for adding a new benchmark or training environment to NeMo-Gym.
5+
Use when the user asks to add, create, or integrate a benchmark, evaluation,
6+
training environment, or resource server into NeMo-Gym. Also use when wrapping
7+
an existing 3rd-party benchmark library. Covers the full workflow: data preparation,
8+
resource server implementation, agent wiring, YAML config, testing, and reward
9+
profiling (baselining). Triggered by: "add benchmark", "new resource server",
10+
"integrate benchmark", "wrap benchmark", "add training environment", "add eval".
11+
---
12+
13+
# Add Benchmark to NeMo-Gym
14+
15+
## Determine Integration Type
16+
17+
Before starting, determine which type of benchmark you're adding:
18+
19+
**Native benchmark** — verification logic implemented directly in a Gym resource server:
20+
- Resource server implements `verify()` with reward logic
21+
- Agent server orchestrates model calls (use `simple_agent` for single-turn, or custom agent for multi-turn)
22+
- Example: `code_gen`, `instruction_following`, `math_with_judge`
23+
24+
**External benchmark** — wrapping a 3rd-party library that has its own orchestration:
25+
- Integrate at the agent server level (not resource server)
26+
- Agent's `/run` endpoint wraps the external library
27+
- Pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`
28+
- Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
29+
- Add the dependency in `requirements.txt`
30+
31+
## Workflow
32+
33+
### Step 1: Scaffold the server
34+
35+
Run `ng_init_resources_server` to generate the directory structure:
36+
37+
```bash
38+
ng_init_resources_server +entrypoint=resources_servers/my_benchmark
39+
```
40+
41+
This creates:
42+
```
43+
resources_servers/my_benchmark/
44+
├── app.py # Server template
45+
├── configs/my_benchmark.yaml
46+
├── data/.gitignore
47+
├── tests/test_app.py
48+
├── requirements.txt
49+
└── README.md
50+
```
51+
52+
For external benchmarks, create the agent server manually under `responses_api_agents/my_agent/` with the same structure.
53+
54+
### Step 2: Prepare data
55+
56+
Convert your source dataset to Gym JSONL format. Each line must have `responses_create_params.input` (OpenAI message format). Task-specific verification data goes in `verifier_metadata`.
57+
58+
```json
59+
{
60+
"responses_create_params": {
61+
"input": [
62+
{"role": "system", "content": "System prompt"},
63+
{"role": "user", "content": "Problem statement"}
64+
]
65+
},
66+
"verifier_metadata": {
67+
"test_cases": [{"input": "...", "expected_output": "..."}],
68+
"task_id": "unique_id"
69+
}
70+
}
71+
```
72+
73+
**Data conversion**: Write conversion scripts in the **source repo** (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See `references/patterns.md` § "Data Conversion Script Pattern".
74+
75+
**`example.jsonl`**: Generate 5 entries for smoke testing. This file is committed directly to git in `data/example.jsonl`.
76+
77+
**`train`/`validation` datasets**: Upload to the GitLab dataset registry — these must NOT be committed to git.
78+
79+
```bash
80+
ng_upload_dataset_to_gitlab \
81+
+dataset_name=my_benchmark \
82+
+version=0.0.1 \
83+
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
84+
```
85+
86+
Requires MLflow credentials in `env.yaml` (or passed via CLI):
87+
```yaml
88+
mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
89+
mlflow_tracking_token: <your-gitlab-api-token>
90+
```
91+
92+
**`data/.gitignore`**: The scaffold generates default patterns (`*train.jsonl`, `*validation.jsonl`, etc.). If your filename doesn't match (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached <file>`.
93+
94+
**Validate** your data:
95+
```bash
96+
# Validate example data (for PR submission)
97+
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
98+
+output_dirpath=/tmp/prepare +mode=example_validation
99+
100+
# Download and prepare train/validation from GitLab
101+
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
102+
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
103+
```
104+
105+
### Step 3: Implement verify()
106+
107+
Edit `app.py`. The `verify()` method receives model output + `verifier_metadata`, returns reward.
108+
109+
For code execution benchmarks, see `references/patterns.md` § "Subprocess Execution with Ray" and "Resource Server Pattern".
110+
111+
Critical rules:
112+
- Return `reward` as 0.0 or 1.0 (binary)
113+
- Handle empty/missing model output gracefully — return 0.0, don't crash
114+
- Must handle 4k-65k concurrent requests without crashing
115+
- Use `asyncio.Semaphore` for subprocess concurrency control
116+
- For Ray remote tasks: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` in async context.
117+
- Decode subprocess output with `errors="replace"`
118+
- Strip `<think>`/`<thinking>` blocks before parsing model output (thinking models emit these)
119+
- Tests should `pytest.mark.skipif` when external tools aren't installed
120+
- If the benchmark auto-installs its tool (see Step 3b), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` evaluates at import time, before fixtures run
121+
122+
### Step 3b: Auto-install external tools (if applicable)
123+
124+
If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See `references/patterns.md` § "External Tool Auto-Install Pattern".
125+
126+
Key points:
127+
- Create `setup_<tool>.py` with `ensure_<tool>()` — checks PATH, forks on `sys.platform` (brew on macOS, build from source on Linux)
128+
- Call it in `model_post_init()` before semaphore init
129+
- Build scripts should be idempotent and install into a local gitignored prefix
130+
- Add a `pytest_configure` hook in `tests/conftest.py` that calls `ensure_<tool>()` before collection
131+
132+
### Step 4: Wire YAML config
133+
134+
Edit `configs/my_benchmark.yaml`. Define the resource server instance and agent pairing(s). See `references/patterns.md` § "YAML Config Pattern".
135+
136+
Key points:
137+
- `verified: false` is auto-added by pre-commit hook (set to `true` after baselining)
138+
- `license` is required for `train` and `validation` datasets
139+
- Agent references resource server and model server by instance name
140+
141+
For multi-turn benchmarks, either use `proof_refinement_agent` or create a custom agent. See `references/patterns.md` § "Agent Patterns".
142+
143+
For `train`/`validation` datasets, add `gitlab_identifier` alongside `jsonl_fpath`:
144+
```yaml
145+
datasets:
146+
- name: my_dataset
147+
type: train
148+
jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
149+
gitlab_identifier:
150+
dataset_name: my_benchmark
151+
version: 0.0.1
152+
artifact_fpath: my_dataset.jsonl
153+
license: MIT
154+
- name: example
155+
type: example
156+
jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl
157+
```
158+
159+
Both fields must coexist: `jsonl_fpath` is the local download destination, `gitlab_identifier` tells the system where to fetch from. `example` datasets don't need `gitlab_identifier` — they're committed to git directly.
160+
161+
### Step 5: Test
162+
163+
```bash
164+
# Run server tests (creates isolated .venv, slow on first run)
165+
ng_test +entrypoint=resources_servers/my_benchmark
166+
167+
# Run core library tests to check nothing broke
168+
pytest tests/unit_tests/ -x
169+
```
170+
171+
Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.
172+
173+
### Step 6: Smoke test end-to-end
174+
175+
```bash
176+
# Start servers
177+
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
178+
179+
# Quick test with example data
180+
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
181+
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
182+
+output_jsonl_fpath=results/example_rollouts.jsonl \
183+
+num_repeats=1 \
184+
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
185+
186+
# Inspect results
187+
```
188+
189+
### Step 7: Baseline (reward profiling)
190+
191+
Run against multiple models to validate correctness. Recommended suite:
192+
- Your policy model of interest
193+
- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
194+
- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
195+
- At least one closed-source model (e.g. GPT-5 Nano or GPT-5)
196+
197+
```bash
198+
# Collect rollouts
199+
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
200+
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
201+
+output_jsonl_fpath=results/rollouts.jsonl \
202+
+num_repeats=5 \
203+
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
204+
205+
# Compute per-task pass rates
206+
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
207+
+rollouts_jsonl_fpath=results/rollouts.jsonl \
208+
+output_jsonl_fpath=results/profiled.jsonl \
209+
+pass_threshold=1.0
210+
211+
# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
212+
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
213+
```
214+
215+
Increase `num_repeats` until variance < 1% across runs on the same model.
216+
217+
Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.
218+
219+
For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.
220+
221+
### Step 8: Pre-commit and PR
222+
223+
```bash
224+
pre-commit run --all-files
225+
```
226+
227+
First run may fail as hooks auto-modify files (`verified: false` flag, README table). Stage changes and run again.
228+
229+
Set `verified: true` in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.
230+
231+
To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
232+
```bash
233+
pre-commit run --files resources_servers/my_benchmark/**/*
234+
```
235+
If hooks modify files in other directories, discard those changes:
236+
```bash
237+
git checkout -- resources_servers/other_server/
238+
```
239+
240+
## Constraints
241+
242+
- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other
243+
- Pass configuration through Gym config (YAML), not environment variables
244+
- Code must run on Linux
245+
- `/run` endpoint must be async
246+
- Errors from tool execution or bad model output must return error responses, not crash
247+
- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`)
248+
249+
## Reference
250+
251+
For detailed code patterns, schemas, and examples: see [references/patterns.md](references/patterns.md).

0 commit comments

Comments
 (0)