Feature Description & Motivation
The torchtitan test case (3.test_cases/pytorch/torchtitan/) was written around March 2025 against an early version of pytorch/torchtitan. Since then, torchtitan has had three releases (v0.1.0, v0.2.0, v0.2.1), restructured its directory layout, expanded from 1 model family to 6, and added numerous distributed training features. The test case is now broken out of the box because the config path it references no longer exists upstream.
What's broken
-
Config path no longer exists — the sbatch script references torchtitan/models/llama/train_configs/llama3_8b.toml, but upstream renamed models/llama/ to models/llama3/ (the old path returns a 404). The test case will fail immediately on a fresh clone.
-
Float8 TOML keys are stale — the README shows a [float8] section with enable_float8_linear = true, but upstream restructured this to [quantize.linear.float8] and removed the enable_float8_linear key. Following the README instructions will produce a config error.
-
CUDA version mismatch — LD_PRELOAD points to /usr/local/cuda-12.1/lib/libnccl.so but pip installs from cu124 (CUDA 12.4). The latest torchtitan v0.2.1 uses cu126 (CUDA 12.6).
What's outdated
-
Unpinned versions everywhere — git clone clones torchtitan at HEAD with no tag/commit, pip3 install --pre torch pulls whatever nightly is current, and pip install --pre torchao likewise. Torchtitan now has releases (v0.1.0, v0.2.0, v0.2.1) that pin compatible torch + torchao versions, which should be used instead.
-
Only Llama 3.1 8B — upstream now provides training configs for 6 model families:
| Model |
Configs available |
Notes |
| Llama 3.1 |
8B, 70B, 405B |
Already partially covered |
| Llama 4 (MoE) |
17Bx16E, 17Bx128E |
Mixture-of-Experts with expert parallelism |
| DeepSeek-V3 |
16B, 671B |
MoE architecture |
| Qwen 3 |
0.6B, 1.7B, 32B, MoE |
Dense and MoE variants |
| Flux |
dev, schnell |
Image generation (diffusion) |
| GPT-OSS |
debug |
HF checkpoint loading |
-
Key features list is incomplete — the README lists 6 upstream features. Upstream now advertises 16+, including many that are directly relevant to HyperPod users:
| Upstream feature |
In ADT test case? |
| FSDP2 with per-param sharding |
Yes (default config) |
| FP8 via torchao |
Partially (stale TOML keys) |
| torch.compile |
Mentioned but not demonstrated |
| Async Tensor Parallelism |
Mentioned but tp_degree=1 in config |
| Pipeline Parallelism (zero-bubble) |
Mentioned but pp_degree=1 in config |
| Context Parallelism |
Mentioned but cp_degree=1 in config |
| MXFP8 (Blackwell GPUs) |
No |
| DDP / HSDP |
No |
| TorchFT (fault-tolerant elastic training) |
No |
| Distributed Checkpointing (async DCP) |
No (enable = false in config) |
| Activation Checkpointing |
No (not demonstrated) |
| Gradient Accumulation |
No |
| WandB logging |
No |
| Debugging / profiling tools |
No |
| Distributed inference |
No |
-
No Dockerfile / container option — the test case uses a conda environment only. Other ADT test cases (FSDP, DeepSpeed, picotron, TRL, verl) provide Dockerfiles for reproducible container-based execution with Pyxis/Enroot.
Category
Enhancement to existing test case
Alternatives Considered
No alternatives — this is about updating the existing test case to reflect upstream's current state.
Additional Context
Affected files
| File |
Issues |
3.test_cases/pytorch/torchtitan/README.md |
Stale key-features list; stale float8 TOML example |
3.test_cases/pytorch/torchtitan/slurm/README.md |
References old models/llama/ path; stale float8 config instructions |
3.test_cases/pytorch/torchtitan/slurm/0.create_conda_env.sh |
Unpinned git clone; unpinned nightly torch/torchao; uses cu124 but should match release |
3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh |
Broken config path (models/llama/ → models/llama3/); LD_PRELOAD CUDA 12.1 mismatch |
Suggested fixes (prioritized)
| Priority |
Fix |
Effort |
| P0 |
Fix broken config path: models/llama/ → models/llama3/ |
Trivial |
| P0 |
Pin torchtitan to a release tag (e.g., git checkout v0.2.1) and install matching torch + torchao versions from the release notes |
Small |
| P0 |
Fix CUDA version in LD_PRELOAD to match the installed CUDA toolkit |
Trivial |
| P1 |
Update float8 config example to use [quantize.linear.float8] |
Trivial |
| P1 |
Add configs demonstrating TP, PP, and CP (e.g., a llama3_70b config with pp_degree=4, tp_degree=2) |
Medium |
| P1 |
Add activation checkpointing and async DCP examples |
Small |
| P2 |
Add a Dockerfile for container-based execution (Pyxis/Enroot compatible) |
Medium |
| P2 |
Add configs for additional model families (Llama 4 MoE, DeepSeek-V3, Qwen 3) |
Medium |
| P2 |
Document torch.compile and FP8 as first-class config options rather than afterthought "optimization tips" |
Small |
| P3 |
Add WandB logging configuration example |
Small |
| P3 |
Update key-features list in the top-level README to match upstream |
Trivial |
Upstream reference
- Repo: pytorch/torchtitan — 5,076 stars, actively developed (daily commits)
- Latest release: v0.2.1 (2025-12-26), requires
torch-2.11.0.dev20251226+cu126 + torchao-0.16.0.dev20251226+cu126
- Models directory:
torchtitan/models/{llama3, llama4, deepseek_v3, qwen3, flux, gpt_oss}
- Config format: TOML with sections
[job], [model], [training], [parallelism], [compile], [activation_checkpoint], [quantize.linear.float8], [checkpoint], [validation]
Feature Description & Motivation
The torchtitan test case (
3.test_cases/pytorch/torchtitan/) was written around March 2025 against an early version of pytorch/torchtitan. Since then, torchtitan has had three releases (v0.1.0, v0.2.0, v0.2.1), restructured its directory layout, expanded from 1 model family to 6, and added numerous distributed training features. The test case is now broken out of the box because the config path it references no longer exists upstream.What's broken
Config path no longer exists — the sbatch script references
torchtitan/models/llama/train_configs/llama3_8b.toml, but upstream renamedmodels/llama/tomodels/llama3/(the old path returns a 404). The test case will fail immediately on a fresh clone.Float8 TOML keys are stale — the README shows a
[float8]section withenable_float8_linear = true, but upstream restructured this to[quantize.linear.float8]and removed theenable_float8_linearkey. Following the README instructions will produce a config error.CUDA version mismatch —
LD_PRELOADpoints to/usr/local/cuda-12.1/lib/libnccl.sobut pip installs fromcu124(CUDA 12.4). The latest torchtitan v0.2.1 usescu126(CUDA 12.6).What's outdated
Unpinned versions everywhere —
git cloneclones torchtitan at HEAD with no tag/commit,pip3 install --pre torchpulls whatever nightly is current, andpip install --pre torchaolikewise. Torchtitan now has releases (v0.1.0, v0.2.0, v0.2.1) that pin compatible torch + torchao versions, which should be used instead.Only Llama 3.1 8B — upstream now provides training configs for 6 model families:
Key features list is incomplete — the README lists 6 upstream features. Upstream now advertises 16+, including many that are directly relevant to HyperPod users:
tp_degree=1in configpp_degree=1in configcp_degree=1in configenable = falsein config)No Dockerfile / container option — the test case uses a conda environment only. Other ADT test cases (FSDP, DeepSpeed, picotron, TRL, verl) provide Dockerfiles for reproducible container-based execution with Pyxis/Enroot.
Category
Enhancement to existing test case
Alternatives Considered
No alternatives — this is about updating the existing test case to reflect upstream's current state.
Additional Context
Affected files
3.test_cases/pytorch/torchtitan/README.md3.test_cases/pytorch/torchtitan/slurm/README.mdmodels/llama/path; stale float8 config instructions3.test_cases/pytorch/torchtitan/slurm/0.create_conda_env.shgit clone; unpinned nightly torch/torchao; usescu124but should match release3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.shmodels/llama/→models/llama3/);LD_PRELOADCUDA 12.1 mismatchSuggested fixes (prioritized)
models/llama/→models/llama3/git checkout v0.2.1) and install matching torch + torchao versions from the release notesLD_PRELOADto match the installed CUDA toolkit[quantize.linear.float8]llama3_70bconfig withpp_degree=4, tp_degree=2)Upstream reference
torch-2.11.0.dev20251226+cu126+torchao-0.16.0.dev20251226+cu126torchtitan/models/{llama3, llama4, deepseek_v3, qwen3, flux, gpt_oss}[job],[model],[training],[parallelism],[compile],[activation_checkpoint],[quantize.linear.float8],[checkpoint],[validation]