[recipe] feat: add Qwen3-0.6B 128K SFT recipe with YaRN RoPE scaling#3316
[recipe] feat: add Qwen3-0.6B 128K SFT recipe with YaRN RoPE scaling#3316
Conversation
📝 WalkthroughWalkthroughAdds a new long-context supervised fine-tuning recipe for Qwen3-0.6B with YaRN RoPE scaling at 128K sequence length. Includes a Bash training script, exported recipe configuration, and a recipe factory function that configures the model, dataset with HuggingFace integration, context parallelism, and long-context training hyperparameters. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/long_context/qwen3_600m_sft_yarn_128k.sh (2)
26-26: Unused variableSEQ_LENGTH.
SEQ_LENGTHis defined but never referenced in the script. The sequence length is embedded in the recipe itself (128*1024). Consider removing this variable or adding a comment explaining it's for documentation purposes.Suggested fix
-SEQ_LENGTH=131072 +# Sequence length is configured in the recipe (128K = 131072)Or simply remove the line if it serves no purpose.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/long_context/qwen3_600m_sft_yarn_128k.sh` at line 26, The variable SEQ_LENGTH is defined but never used; either remove the SEQ_LENGTH line or make its intent explicit by converting it into a documented constant/comment (e.g., note that the recipe uses 128*1024) so it isn't unused — update the SEQ_LENGTH declaration accordingly or delete it to eliminate the dead variable.
37-50: Consider quoting variable expansions to prevent word splitting.Static analysis flagged unquoted variables. While these controlled variables are unlikely to cause issues, quoting prevents unexpected behavior if paths contain spaces.
Suggested fix
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \ - --recipe ${MODEL_NAME}_sft_yarn_128k_config \ - checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \ + --recipe "${MODEL_NAME}_sft_yarn_128k_config" \ + checkpoint.pretrained_checkpoint="$PRETRAINED_CHECKPOINT" \ train.train_iters=$TRAIN_ITERS \ train.global_batch_size=$GLOBAL_BATCH_SIZE \ train.micro_batch_size=$MICRO_BATCH_SIZE \ validation.eval_iters=$EVAL_ITERS \ validation.eval_interval=$EVAL_INTERVAL \ scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \ - checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft \ - checkpoint.load=${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft \ + checkpoint.save="${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft" \ + checkpoint.load="${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft" \ logger.log_interval=$LOG_INTERVAL \ - logger.wandb_project=$WANDB_PROJECT \ - logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_yarn_128k_sft + logger.wandb_project="$WANDB_PROJECT" \ + logger.wandb_exp_name="${MODEL_NAME}_${DATASET_NAME}_yarn_128k_sft"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/long_context/qwen3_600m_sft_yarn_128k.sh` around lines 37 - 50, The shell command uses unquoted variable expansions (e.g., ${MODEL_NAME}_sft_yarn_128k_config, $PRETRAINED_CHECKPOINT, $TRAIN_ITERS, $GLOBAL_BATCH_SIZE, $MICRO_BATCH_SIZE, $EVAL_ITERS, $EVAL_INTERVAL, $LR_WARMUP_ITERS, ${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft, $LOG_INTERVAL, $WANDB_PROJECT, ${MODEL_NAME}_${DATASET_NAME}_yarn_128k_sft) which can cause word-splitting if any contain spaces; update the invocation in the uv run / python -m torch.distributed.run command to wrap each variable expansion in double quotes (e.g., "$PRETRAINED_CHECKPOINT", "${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft", etc.) so all arguments are passed as single tokens.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/recipes/qwen/qwen3.py`:
- Around line 659-660: The block that sets context parallelism only assigns
cfg.model.context_parallel_size = 8 but omits the CP communication mode; mirror
the earlier recipe (qwen3_600m_sft_128k_config) by also setting
cfg.model.cp_comm_type = "a2a" to use all-to-all CP and avoid NaN gradients—add
cfg.model.cp_comm_type = "a2a" alongside cfg.model.context_parallel_size = 8 in
the same config function/block.
---
Nitpick comments:
In `@examples/long_context/qwen3_600m_sft_yarn_128k.sh`:
- Line 26: The variable SEQ_LENGTH is defined but never used; either remove the
SEQ_LENGTH line or make its intent explicit by converting it into a documented
constant/comment (e.g., note that the recipe uses 128*1024) so it isn't unused —
update the SEQ_LENGTH declaration accordingly or delete it to eliminate the dead
variable.
- Around line 37-50: The shell command uses unquoted variable expansions (e.g.,
${MODEL_NAME}_sft_yarn_128k_config, $PRETRAINED_CHECKPOINT, $TRAIN_ITERS,
$GLOBAL_BATCH_SIZE, $MICRO_BATCH_SIZE, $EVAL_ITERS, $EVAL_INTERVAL,
$LR_WARMUP_ITERS, ${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft,
$LOG_INTERVAL, $WANDB_PROJECT, ${MODEL_NAME}_${DATASET_NAME}_yarn_128k_sft)
which can cause word-splitting if any contain spaces; update the invocation in
the uv run / python -m torch.distributed.run command to wrap each variable
expansion in double quotes (e.g., "$PRETRAINED_CHECKPOINT",
"${WORKSPACE}/results/${MODEL_NAME}_yarn_128k_sft", etc.) so all arguments are
passed as single tokens.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: c1053130-9c1f-4032-83c2-5af417cafb7e
📒 Files selected for processing (3)
examples/long_context/qwen3_600m_sft_yarn_128k.shsrc/megatron/bridge/recipes/qwen/__init__.pysrc/megatron/bridge/recipes/qwen/qwen3.py
|
/ok to test 119631c |
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
119631c to
9bbddfe
Compare
|
/ok to test 9bbddfe |
What does this PR do ?
Adds a new SFT training recipe for Qwen3-0.6B at 128K context length using YaRN RoPE scaling, together with a reference launch script.
YaRN scaling extends Qwen3-0.6B's native 40K context window to 128K:
Long-context SFT stability settings:
Dataset
Specific data file to avoid too mush time spend on data download.
nvidia/Nemotron-Cascade-2-SFT-Data
Overview of dataset
Result
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes