Skip to content

Commit 637eb18

Browse files
committed
readme
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
1 parent 81fabd9 commit 637eb18

File tree

1 file changed

+198
-0
lines changed
  • bionemo-recipes/recipes/codonfm_native_te

1 file changed

+198
-0
lines changed
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# TransformerEngine-accelerated CodonFM training with native PyTorch training loop
2+
3+
This folder demonstrates how to train TE-accelerated
4+
[CodonFM](https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf) with a native PyTorch
5+
training loop, including sequence packing, FP8/NVFP4 precision with layer-wise control, using fully sharded data
6+
parallel (FSDP2) for distributed training.
7+
8+
CodonFM is a suite of foundation models trained directly on codon sequences to learn contextual codon representations and
9+
enable downstream codon-aware tasks. This recipe uses the "non-exact" TransformerEngine implementation, which employs
10+
TE's standard `TransformerLayer` rather than a custom reproduction of the original research architecture. Despite the
11+
slight architectural difference, this variant converges on par with the original. For the original PyTorch Lightning
12+
based recipe (with both "exact" and "non-exact" TE modes), see
13+
[codonfm_ptl_te](../codonfm_ptl_te/).
14+
15+
## How to use this recipe
16+
17+
This folder contains an independent, minimal training example. It does not depend on any other code in the top-level
18+
bionemo-framework repository.
19+
20+
## Supported Training Features
21+
22+
| Feature | Status |
23+
| ---------------------- | ------------- |
24+
| BF16 | Supported |
25+
| FP8 (DelayedScaling) | Supported [1] |
26+
| NVFP4 (BlockScaling) | Supported [2] |
27+
| Layer-wise precision | Supported |
28+
| THD sequence packing | Supported |
29+
| FSDP2 | Supported |
30+
| Checkpoint save/resume | Supported |
31+
32+
\[1\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 9.0 and above (Hopper+) <br/>
33+
\[2\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 10.0 and above (Blackwell+) <br/>
34+
35+
## Pre-Trained Models
36+
37+
The HuggingFace-compatible model definition lives in
38+
[`bionemo-recipes/models/codonfm/`](../../models/codonfm/). Pre-trained checkpoints for the "exact" TE architecture are
39+
available on the Hugging Face Hub (see the [codonfm_ptl_te README](../codonfm_ptl_te/README.md#pre-trained-models) for
40+
links). A native_te checkpoint trained with this recipe will be uploaded in the future.
41+
42+
## Performance Benchmarks
43+
44+
Under development. For TE acceleration benchmarks comparing against the original Xformers-based implementation, see the
45+
[codonfm_ptl_te benchmarks](../codonfm_ptl_te/README.md#nvidia-transformerengine-optimization-benchmarks).
46+
47+
## Repository Structure
48+
49+
```
50+
codonfm_native_te/
51+
├── modeling_codonfm_te.py — HF-compatible CodonFM model with TE layers
52+
├── train_fsdp2.py — FSDP2 training script
53+
├── dataset.py — data loading and collation (BSHD + THD)
54+
├── tokenizer.py — codon tokenizer
55+
├── checkpoint.py — checkpoint save/load utilities
56+
├── perf_logger.py — performance and metrics logging
57+
├── quantization.py — FP8/FP4 quantization utilities
58+
├── scheduler.py — learning rate scheduler
59+
├── distributed_config.py — distributed training configuration
60+
├── hydra_config/ — Hydra configuration files
61+
│ ├── defaults.yaml — default training configuration
62+
│ └── L0_sanity.yaml — quick sanity check configuration
63+
├── train.parquet — sample training data
64+
├── requirements.txt — Python dependencies
65+
└── tests/ — unit and integration tests
66+
```
67+
68+
## Installing Dependencies
69+
70+
The easiest way to get started is to use the devcontainer provided in the top-level repository. Alternatively, install
71+
dependencies manually in an environment with CUDA support:
72+
73+
```bash
74+
pip install -r requirements.txt
75+
```
76+
77+
## Commands to Launch Training
78+
79+
To run single-process training on one GPU:
80+
81+
```bash
82+
python train_fsdp2.py
83+
```
84+
85+
To run multi-process training locally on 2+ GPUs:
86+
87+
```bash
88+
torchrun --nproc_per_node=2 train_fsdp2.py
89+
```
90+
91+
The default configuration (`L0_sanity.yaml`) runs a quick sanity check with 250 steps using the included sample data.
92+
For real training, create a custom Hydra config or override parameters from the command line.
93+
94+
### Quantized Training (FP8 / NVFP4)
95+
96+
To run training with FP8:
97+
98+
```bash
99+
python train_fsdp2.py fp8_config.enabled=true
100+
```
101+
102+
To train with NVFP4 quantization:
103+
104+
```bash
105+
python train_fsdp2.py fp4_config.enabled=true
106+
```
107+
108+
Additional recipe parameters (e.g., switching to `MXFP8BlockScaling`) can be set via the Hydra configuration.
109+
110+
### Layer-Wise Precision
111+
112+
You can control which transformer layers use FP8 or FP4 by specifying 1-indexed layer numbers via `fp8_layers` and
113+
`fp4_layers`. Layers not assigned to either format will run in BF16.
114+
115+
For example, to run layers 1-3 in FP8, layers 4-6 in FP4, and the rest in BF16:
116+
117+
```bash
118+
python train_fsdp2.py \
119+
fp8_config.enabled=true \
120+
fp4_config.enabled=true \
121+
'fp8_layers=[1,2,3]' \
122+
'fp4_layers=[4,5,6]'
123+
```
124+
125+
When both `fp8_config` and `fp4_config` are enabled but only one layer list is provided, the other format automatically
126+
claims the remaining layers.
127+
128+
### Sequence Packing (THD input format)
129+
130+
Enable sequence packing with:
131+
132+
```bash
133+
python train_fsdp2.py use_sequence_packing=true
134+
```
135+
136+
### FP8 and Sequence Packing
137+
138+
To combine FP8 training with sequence packing:
139+
140+
```bash
141+
python train_fsdp2.py fp8_config.enabled=true use_sequence_packing=true
142+
```
143+
144+
### Quantization Stats Debugging
145+
146+
To enable quantization statistics logging:
147+
148+
```bash
149+
python train_fsdp2.py \
150+
quant_stats_config.enabled=true \
151+
quant_stats_config.quant_log_dir=./logs/quant_stats \
152+
quant_stats_config.quant_stats_file=./fp8_debugging_stats.yaml \
153+
fp8_config.enabled=true
154+
```
155+
156+
The config file structure [fp8_debugging_stats.yaml](fp8_debugging_stats.yaml) is explained in the
157+
[NVIDIA Transformer Engine config file documentation](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/debug/2_config_file_structure.html).
158+
159+
## Saving and Loading Checkpoints
160+
161+
To enable checkpoint saving, ensure that `checkpoint.ckpt_dir` is set to a writable directory:
162+
163+
```bash
164+
python train_fsdp2.py \
165+
checkpoint.ckpt_dir=/path/to/ckpt_dir \
166+
checkpoint.save_every_n_steps=100
167+
```
168+
169+
To resume from the latest checkpoint:
170+
171+
```bash
172+
python train_fsdp2.py \
173+
checkpoint.ckpt_dir=/path/to/ckpt_dir \
174+
checkpoint.resume_from_checkpoint=true
175+
```
176+
177+
A final model suitable for uploading to the Hugging Face Hub can be exported at the end of training by setting
178+
`checkpoint.save_final_model=true`.
179+
180+
## Developer Guide
181+
182+
### Running Tests
183+
184+
To run tests locally inside the devcontainer:
185+
186+
```bash
187+
cd bionemo-recipes/recipes/codonfm_native_te
188+
pytest -v tests/
189+
```
190+
191+
### Hydra Tips
192+
193+
[Hydra](https://hydra.cc/) is used for configuration management. Parameters can be overridden from the command line,
194+
e.g., `python train_fsdp2.py fp8_config.enabled=true`. For verbose logging, use `hydra.verbose=true`.
195+
196+
## License
197+
198+
Refer to [LICENSE](../../LICENSE).

0 commit comments

Comments
 (0)