[Feature]: Update torchtitan test case — broken paths, stale configs, missing models and parallelism features

### Feature Description & Motivation

The torchtitan test case (`3.test_cases/pytorch/torchtitan/`) was written around March 2025 against an early version of [pytorch/torchtitan](https://github.com/pytorch/torchtitan). Since then, torchtitan has had three releases (v0.1.0, v0.2.0, v0.2.1), restructured its directory layout, expanded from 1 model family to 6, and added numerous distributed training features. **The test case is now broken out of the box** because the config path it references no longer exists upstream.

#### What's broken

1. **Config path no longer exists** — the sbatch script references `torchtitan/models/llama/train_configs/llama3_8b.toml`, but upstream renamed `models/llama/` to `models/llama3/` (the old path returns a 404). The test case will fail immediately on a fresh clone.

2. **Float8 TOML keys are stale** — the README shows a `[float8]` section with `enable_float8_linear = true`, but upstream restructured this to `[quantize.linear.float8]` and removed the `enable_float8_linear` key. Following the README instructions will produce a config error.

3. **CUDA version mismatch** — `LD_PRELOAD` points to `/usr/local/cuda-12.1/lib/libnccl.so` but pip installs from `cu124` (CUDA 12.4). The latest torchtitan v0.2.1 uses `cu126` (CUDA 12.6).

#### What's outdated

4. **Unpinned versions everywhere** — `git clone` clones torchtitan at HEAD with no tag/commit, `pip3 install --pre torch` pulls whatever nightly is current, and `pip install --pre torchao` likewise. Torchtitan now has releases (v0.1.0, v0.2.0, v0.2.1) that pin compatible torch + torchao versions, which should be used instead.

5. **Only Llama 3.1 8B** — upstream now provides training configs for 6 model families:

   | Model | Configs available | Notes |
   |-------|-------------------|-------|
   | **Llama 3.1** | 8B, 70B, 405B | Already partially covered |
   | **Llama 4 (MoE)** | 17Bx16E, 17Bx128E | Mixture-of-Experts with expert parallelism |
   | **DeepSeek-V3** | 16B, 671B | MoE architecture |
   | **Qwen 3** | 0.6B, 1.7B, 32B, MoE | Dense and MoE variants |
   | **Flux** | dev, schnell | Image generation (diffusion) |
   | **GPT-OSS** | debug | HF checkpoint loading |

6. **Key features list is incomplete** — the README lists 6 upstream features. Upstream now advertises 16+, including many that are directly relevant to HyperPod users:

   | Upstream feature | In ADT test case? |
   |-----------------|-------------------|
   | FSDP2 with per-param sharding | Yes (default config) |
   | FP8 via torchao | Partially (stale TOML keys) |
   | torch.compile | Mentioned but not demonstrated |
   | Async Tensor Parallelism | Mentioned but `tp_degree=1` in config |
   | Pipeline Parallelism (zero-bubble) | Mentioned but `pp_degree=1` in config |
   | Context Parallelism | Mentioned but `cp_degree=1` in config |
   | MXFP8 (Blackwell GPUs) | No |
   | DDP / HSDP | No |
   | TorchFT (fault-tolerant elastic training) | No |
   | Distributed Checkpointing (async DCP) | No (`enable = false` in config) |
   | Activation Checkpointing | No (not demonstrated) |
   | Gradient Accumulation | No |
   | WandB logging | No |
   | Debugging / profiling tools | No |
   | Distributed inference | No |

7. **No Dockerfile / container option** — the test case uses a conda environment only. Other ADT test cases (FSDP, DeepSpeed, picotron, TRL, verl) provide Dockerfiles for reproducible container-based execution with Pyxis/Enroot.

### Category

Enhancement to existing test case

### Alternatives Considered

_No alternatives — this is about updating the existing test case to reflect upstream's current state._

### Additional Context

#### Affected files

| File | Issues |
|------|--------|
| `3.test_cases/pytorch/torchtitan/README.md` | Stale key-features list; stale float8 TOML example |
| `3.test_cases/pytorch/torchtitan/slurm/README.md` | References old `models/llama/` path; stale float8 config instructions |
| `3.test_cases/pytorch/torchtitan/slurm/0.create_conda_env.sh` | Unpinned `git clone`; unpinned nightly torch/torchao; uses `cu124` but should match release |
| `3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh` | Broken config path (`models/llama/` → `models/llama3/`); `LD_PRELOAD` CUDA 12.1 mismatch |

#### Suggested fixes (prioritized)

| Priority | Fix | Effort |
|----------|-----|--------|
| **P0** | Fix broken config path: `models/llama/` → `models/llama3/` | Trivial |
| **P0** | Pin torchtitan to a release tag (e.g., `git checkout v0.2.1`) and install matching torch + torchao versions from the [release notes](https://github.com/pytorch/torchtitan/releases/tag/v0.2.1) | Small |
| **P0** | Fix CUDA version in `LD_PRELOAD` to match the installed CUDA toolkit | Trivial |
| **P1** | Update float8 config example to use `[quantize.linear.float8]` | Trivial |
| **P1** | Add configs demonstrating TP, PP, and CP (e.g., a `llama3_70b` config with `pp_degree=4, tp_degree=2`) | Medium |
| **P1** | Add activation checkpointing and async DCP examples | Small |
| **P2** | Add a Dockerfile for container-based execution (Pyxis/Enroot compatible) | Medium |
| **P2** | Add configs for additional model families (Llama 4 MoE, DeepSeek-V3, Qwen 3) | Medium |
| **P2** | Document torch.compile and FP8 as first-class config options rather than afterthought "optimization tips" | Small |
| **P3** | Add WandB logging configuration example | Small |
| **P3** | Update key-features list in the top-level README to match upstream | Trivial |

#### Upstream reference

- **Repo**: [pytorch/torchtitan](https://github.com/pytorch/torchtitan) — 5,076 stars, actively developed (daily commits)
- **Latest release**: [v0.2.1](https://github.com/pytorch/torchtitan/releases/tag/v0.2.1) (2025-12-26), requires `torch-2.11.0.dev20251226+cu126` + `torchao-0.16.0.dev20251226+cu126`
- **Models directory**: `torchtitan/models/{llama3, llama4, deepseek_v3, qwen3, flux, gpt_oss}`
- **Config format**: TOML with sections `[job]`, `[model]`, `[training]`, `[parallelism]`, `[compile]`, `[activation_checkpoint]`, `[quantize.linear.float8]`, `[checkpoint]`, `[validation]`


File	Issues
`3.test_cases/pytorch/torchtitan/README.md`	Stale key-features list; stale float8 TOML example
`3.test_cases/pytorch/torchtitan/slurm/README.md`	References old `models/llama/` path; stale float8 config instructions
`3.test_cases/pytorch/torchtitan/slurm/0.create_conda_env.sh`	Unpinned `git clone`; unpinned nightly torch/torchao; uses `cu124` but should match release
`3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh`	Broken config path (`models/llama/` → `models/llama3/`); `LD_PRELOAD` CUDA 12.1 mismatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Update torchtitan test case — broken paths, stale configs, missing models and parallelism features #976

Feature Description & Motivation

What's broken

What's outdated

Category

Alternatives Considered

Additional Context

Affected files

Suggested fixes (prioritized)

Upstream reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Configs available	Notes
Llama 3.1	8B, 70B, 405B	Already partially covered
Llama 4 (MoE)	17Bx16E, 17Bx128E	Mixture-of-Experts with expert parallelism
DeepSeek-V3	16B, 671B	MoE architecture
Qwen 3	0.6B, 1.7B, 32B, MoE	Dense and MoE variants
Flux	dev, schnell	Image generation (diffusion)
GPT-OSS	debug	HF checkpoint loading

Upstream feature	In ADT test case?
FSDP2 with per-param sharding	Yes (default config)
FP8 via torchao	Partially (stale TOML keys)
torch.compile	Mentioned but not demonstrated
Async Tensor Parallelism	Mentioned but `tp_degree=1` in config
Pipeline Parallelism (zero-bubble)	Mentioned but `pp_degree=1` in config
Context Parallelism	Mentioned but `cp_degree=1` in config
MXFP8 (Blackwell GPUs)	No
DDP / HSDP	No
TorchFT (fault-tolerant elastic training)	No
Distributed Checkpointing (async DCP)	No (`enable = false` in config)
Activation Checkpointing	No (not demonstrated)
Gradient Accumulation	No
WandB logging	No
Debugging / profiling tools	No
Distributed inference	No

Priority	Fix	Effort
P0	Fix broken config path: `models/llama/` → `models/llama3/`	Trivial
P0	Pin torchtitan to a release tag (e.g., `git checkout v0.2.1`) and install matching torch + torchao versions from the release notes	Small
P0	Fix CUDA version in `LD_PRELOAD` to match the installed CUDA toolkit	Trivial
P1	Update float8 config example to use `[quantize.linear.float8]`	Trivial
P1	Add configs demonstrating TP, PP, and CP (e.g., a `llama3_70b` config with `pp_degree=4, tp_degree=2`)	Medium
P1	Add activation checkpointing and async DCP examples	Small
P2	Add a Dockerfile for container-based execution (Pyxis/Enroot compatible)	Medium
P2	Add configs for additional model families (Llama 4 MoE, DeepSeek-V3, Qwen 3)	Medium
P2	Document torch.compile and FP8 as first-class config options rather than afterthought "optimization tips"	Small
P3	Add WandB logging configuration example	Small
P3	Update key-features list in the top-level README to match upstream	Trivial

[Feature]: Update torchtitan test case — broken paths, stale configs, missing models and parallelism features #976

Description

Feature Description & Motivation

What's broken

What's outdated

Category

Alternatives Considered

Additional Context

Affected files

Suggested fixes (prioritized)

Upstream reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions