Skip to content

feat: support multiple datasets for response dataset#1691

Merged
terrykong merged 8 commits intomainfrom
yukih/multiple-dataset
Feb 3, 2026
Merged

feat: support multiple datasets for response dataset#1691
terrykong merged 8 commits intomainfrom
yukih/multiple-dataset

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Dec 23, 2025

Related issue: #1049

Usage

data:
  _override_: true # override the data config instead of merging with it
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # train dataset 1
    - dataset_name: OpenMathInstruct-2
      split_validation_size: 0.05 # use 5% of the training data as validation data
      seed: 42  # seed for train/validation split when split_validation_size > 0
    # train dataset 2
    - dataset_name: DeepScaler
  validation:
    # validation dataset 1
    - dataset_name: AIME2024
      repeat: 16
    # validation dataset 2
    - dataset_name: DAPOMathAIME2024
  # default settings for all datasets
  default:
    ...

Summary by CodeRabbit

  • New Features

    • Added multi-dataset training support, enabling models to train on multiple datasets simultaneously.
    • Introduced configuration structure for multi-dataset GRPO experimentation with customizable dataset-specific parameters.
  • Tests

    • Added functional test coverage for multi-dataset training workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Dec 23, 2025
@yuki-97 yuki-97 force-pushed the yukih/multiple-dataset branch from ff87f2c to 5835ce7 Compare December 23, 2025 07:59
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Dec 23, 2025
@RayenTian RayenTian force-pushed the yukih/multiple-dataset branch from 94e40a6 to c0b8cde Compare January 2, 2026 03:25
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 2, 2026
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch 2 times, most recently from 2a4cedd to 20f3a62 Compare January 8, 2026 15:47
@yuki-97 yuki-97 force-pushed the yukih/multiple-dataset branch from d9836a6 to 8577efb Compare January 12, 2026 03:08
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2026
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from f9def0d to ec862a3 Compare January 20, 2026 10:40
Base automatically changed from yukih/split-train-val-dataset to main January 22, 2026 00:17
@yuki-97 yuki-97 force-pushed the yukih/multiple-dataset branch 2 times, most recently from 74a26c0 to a990378 Compare January 26, 2026 06:08
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jan 26, 2026
@yuki-97 yuki-97 changed the title [don't merge] support multiple datasets for response dataset feat: support multiple datasets for response dataset Jan 26, 2026
@yuki-97 yuki-97 marked this pull request as ready for review January 26, 2026 06:08
@yuki-97 yuki-97 requested review from a team as code owners January 26, 2026 06:08
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 26, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Introduces multi-dataset support for the GRPO framework through a new configuration file, refactored data handling across two modules, an override-aware config merge utility, and corresponding functional tests.

Changes

Cohort / File(s) Summary
Configuration
examples/configs/grpo_multiple_datasets.yaml
New YAML configuration defining multi-dataset GRPO setup with defaults reference, data overrides, dataset-wide settings (max_input_seq_length, shuffle, num_workers), and per-dataset train/validation lists with optional split/repeat parameters.
Config Utilities
nemo_rl/utils/config.py
Added merge_with_override() utility function that honors _override_: true markers in child configs to fully override parent config subtrees, improving inheritance control in load_config_with_inheritance().
Data Handling
examples/run_sft.py, nemo_rl/data/utils.py
Modified setup_data() and related functions to support multiple training/validation datasets instead of single dataset; collects datasets into lists, optionally applies per-dataset defaults, merges via concatenate_datasets(), and maintains per-task processor/environment bindings across merged datasets.
Testing
tests/functional/grpo_multiple_datasets.sh, tests/functional/L1_Functional_Tests_GPU.sh
New functional test script orchestrating GRPO multi-dataset experiment with metrics validation; test invocation added to L1 GPU test suite.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Config Loader
    participant DataList as Multiple Datasets
    participant Merge as Dataset Merger
    participant Processor as Task Processors
    participant Train as Training

    Config->>DataList: Load each dataset config
    DataList->>DataList: Apply per-dataset defaults
    DataList->>Merge: Collect all datasets
    Merge->>Merge: Concatenate datasets
    Merge->>Processor: Build per-task processors<br/>(per dataset)
    Processor->>Processor: Merge processor mappings<br/>(task_name → processor)
    Processor->>Train: Create AllTaskProcessedDataset<br/>with merged data + processors
    Train->>Train: Execute training
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

Run CICD

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR description lacks actual test results, metric values, and validation confirmation for the major multi-dataset feature despite functional tests being added. Add test execution results showing actual metric values, convergence validation, and confirmation that all code review issues are resolved.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main objective: adding support for multiple datasets in response/training data handling. It directly reflects the core changes across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
nemo_rl/utils/config.py (1)

1-1: Update the copyright header year to include 2026.

Line 1 still shows 2025; please update to include the current year. As per coding guidelines, headers must include the current year.

📝 Suggested header update
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
examples/run_sft.py (1)

1-1: Update the copyright header year to include 2026.

Line 1 still shows 2025; please update to include the current year. As per coding guidelines, headers must include the current year.

📝 Suggested header update
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
nemo_rl/data/utils.py (1)

1-1: Update the copyright header year to include 2026.

Line 1 still shows 2025; please update to include the current year. As per coding guidelines, headers must include the current year.

📝 Suggested header update
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
🤖 Fix all issues with AI agents
In `@nemo_rl/data/utils.py`:
- Around line 75-77: The normalization logic in utils.py currently only wraps
single-dataset entries when data_config["train"] or data_config["validation"] is
a plain dict, but misses omegaconf.DictConfig; import DictConfig from omegaconf
and update the two isinstance checks (the one that normalizes
data_config["train"] and the one for data_config["validation"]) to check
isinstance(..., (dict, DictConfig)) so single-dataset DictConfig objects are
wrapped into a list before iterating (this fixes errors in
load_response_dataset).

In `@tests/functional/grpo_multiple_datasets.sh`:
- Around line 20-40: The shell command at the end of
tests/functional/grpo_multiple_datasets.sh uses an unquoted $@ which can break
arguments with spaces; update the invocation to use "$@" (i.e., replace $@ with
"$@") so all passed overrides/arguments preserve their boundaries when appended
to the uv run command before the redirection and tee pipeline.
🧹 Nitpick comments (1)
nemo_rl/utils/config.py (1)

30-44: Consider supporting nested _override_ markers for consistency and future-proofing.

Currently, the function only processes top-level _override_ markers. While no nested markers are used today, a recursive approach would ensure consistent behavior if nested overrides are ever needed.

Suggested recursive handling
+def _apply_override_markers(
+    base_cfg: DictConfig, override_cfg: DictConfig
+) -> None:
+    for key, value in list(override_cfg.items()):
+        if isinstance(value, DictConfig):
+            if value.get("_override_", False):
+                value.pop("_override_")
+                base_cfg.pop(key, None)
+            else:
+                child_base = base_cfg.get(key)
+                if isinstance(child_base, DictConfig):
+                    _apply_override_markers(child_base, value)
+                else:
+                    _apply_override_markers(OmegaConf.create({}), value)
+
 def merge_with_override(
     base_config: DictConfig, override_config: DictConfig
 ) -> DictConfig:
     """Merge configs with support for _override_ marker to completely override sections."""
-    for key in list(override_config.keys()):
-        if isinstance(override_config[key], DictConfig):
-            if override_config[key].get("_override_", False):
-                # remove the _override_ marker
-                override_config[key].pop("_override_")
-                # remove the key from base_config so it won't be merged
-                if key in base_config:
-                    base_config.pop(key)
+    _apply_override_markers(base_config, override_config)

@yuki-97 yuki-97 requested a review from a team as a code owner February 2, 2026 05:20
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 2, 2026
@yuki-97 yuki-97 force-pushed the yukih/multiple-dataset branch from d1d8e05 to c570a05 Compare February 2, 2026 05:23
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 2, 2026
Signed-off-by: Yuki Huang <[email protected]>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 2, 2026
Signed-off-by: Yuki Huang <[email protected]>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 2, 2026
@terrykong terrykong enabled auto-merge (squash) February 3, 2026 01:45
@terrykong terrykong merged commit 27ba6a0 into main Feb 3, 2026
41 of 42 checks passed
@terrykong terrykong deleted the yukih/multiple-dataset branch February 3, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants