Skip to content

Refactor dataset module for train #909

@yuki-97

Description

@yuki-97

Steps:

  1. Dataset Refactor: refactor: refactor dataset module #977
  2. Decouple Train and Validation Dataset:
    1. SFT/RL: refactor: split train and val dataset in response dataset #1649
    2. RM/DPO: refactor: split train and val dataset in preference dataset #1763
  3. Multiple Datasets Support:
    1. SFT/RL: feat: support multiple datasets for response dataset #1691
    2. RM/DPO:
  4. Clean up
    1. Clean up GRPO: environments: refactor: unify entrypoint for different envs #1841
    2. Refactor data processor.
    3. Refactor prompt management.
    4. Unify dataset_name and dataset_cls.

Step1: Dataset Refactor

  1. Add general dataset class for different modes: sft_dataset, preference_dataset (for RM and DPO), rl_dataset. We can use some keys like prompt_key, chosen_key, rejected_key to specify how to read local or HuggingFace dataset, instead of writing a new dataset class.
  2. For the built-in datasets (e.g. open_assistant, HelpSteer3, etc.), we'll keep them for enabling others to accurately reproduce our results.

After refactor, the usage will become:

  1. For special supported datasets, the usage is the same as before.
  2. For general datasets (local/hf), an example for DPO is below.
data:
    train_data_path: /path/to/local/train_dataset.jsonl
    val_data_path: /path/to/local/val_dataset.jsonl
    dataset_name: BinaryPreferenceDataset
    prompt_key: prompt
    chosen_key: chosen
    rejected_key: rejected

Step2: Decouple Train and Validation Dataset
Train and validation dataset are coupled for now, which means we need write the same logic twice for train and eval when we add support for new dataset, so it's good to decouple them.
After this, the usage will become:

data:
    train:
        data_path: /path/to/local/train_dataset.jsonl
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected
    validation:
        data_path: /path/to/local/val_dataset.jsonl
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected

Step3: Multiple Datasets Support
After this, the usage will become:

data:
    train:
        # this dataset will override prompt_key and use the default values for other vars
        - data_path: /path/to/local/train_dataset_1.jsonl
          prompt_key: context
        # this dataset will use all the default values
        - data_path: /path/to/local/train_dataset_2.jsonl
    validation:
        - data_path: /path/to/local/val_dataset.jsonl
    default:
        # will use below vars as default values if dataset doesn't specify it
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected

Related issues / discussions
#688, #830

Sub-issues

Metadata

Metadata

Labels

UXRelated to user experiencedata module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions