|
| 1 | +# TransformerEngine-accelerated CodonFM training with native PyTorch training loop |
| 2 | + |
| 3 | +This folder demonstrates how to train TE-accelerated |
| 4 | +[CodonFM](https://research.nvidia.com/labs/dbr/assets/data/manuscripts/nv-codonfm-preprint.pdf) with a native PyTorch |
| 5 | +training loop, including sequence packing, FP8/NVFP4 precision with layer-wise control, using fully sharded data |
| 6 | +parallel (FSDP2) for distributed training. |
| 7 | + |
| 8 | +CodonFM is a suite of foundation models trained directly on codon sequences to learn contextual codon representations and |
| 9 | +enable downstream codon-aware tasks. This recipe uses the "non-exact" TransformerEngine implementation, which employs |
| 10 | +TE's standard `TransformerLayer` rather than a custom reproduction of the original research architecture. Despite the |
| 11 | +slight architectural difference, this variant converges on par with the original. For the original PyTorch Lightning |
| 12 | +based recipe (with both "exact" and "non-exact" TE modes), see |
| 13 | +[codonfm_ptl_te](../codonfm_ptl_te/). |
| 14 | + |
| 15 | +## How to use this recipe |
| 16 | + |
| 17 | +This folder contains an independent, minimal training example. It does not depend on any other code in the top-level |
| 18 | +bionemo-framework repository. |
| 19 | + |
| 20 | +## Supported Training Features |
| 21 | + |
| 22 | +| Feature | Status | |
| 23 | +| ---------------------- | ------------- | |
| 24 | +| BF16 | Supported | |
| 25 | +| FP8 (DelayedScaling) | Supported [1] | |
| 26 | +| NVFP4 (BlockScaling) | Supported [2] | |
| 27 | +| Layer-wise precision | Supported | |
| 28 | +| THD sequence packing | Supported | |
| 29 | +| FSDP2 | Supported | |
| 30 | +| Checkpoint save/resume | Supported | |
| 31 | + |
| 32 | +\[1\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 9.0 and above (Hopper+) <br/> |
| 33 | +\[2\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 10.0 and above (Blackwell+) <br/> |
| 34 | + |
| 35 | +## Pre-Trained Models |
| 36 | + |
| 37 | +The HuggingFace-compatible model definition lives in |
| 38 | +[`bionemo-recipes/models/codonfm/`](../../models/codonfm/). Pre-trained checkpoints for the "exact" TE architecture are |
| 39 | +available on the Hugging Face Hub (see the [codonfm_ptl_te README](../codonfm_ptl_te/README.md#pre-trained-models) for |
| 40 | +links). A native_te checkpoint trained with this recipe will be uploaded in the future. |
| 41 | + |
| 42 | +## Performance Benchmarks |
| 43 | + |
| 44 | +Under development. For TE acceleration benchmarks comparing against the original Xformers-based implementation, see the |
| 45 | +[codonfm_ptl_te benchmarks](../codonfm_ptl_te/README.md#nvidia-transformerengine-optimization-benchmarks). |
| 46 | + |
| 47 | +## Repository Structure |
| 48 | + |
| 49 | +``` |
| 50 | +codonfm_native_te/ |
| 51 | +├── modeling_codonfm_te.py — HF-compatible CodonFM model with TE layers |
| 52 | +├── train_fsdp2.py — FSDP2 training script |
| 53 | +├── dataset.py — data loading and collation (BSHD + THD) |
| 54 | +├── tokenizer.py — codon tokenizer |
| 55 | +├── checkpoint.py — checkpoint save/load utilities |
| 56 | +├── perf_logger.py — performance and metrics logging |
| 57 | +├── quantization.py — FP8/FP4 quantization utilities |
| 58 | +├── scheduler.py — learning rate scheduler |
| 59 | +├── distributed_config.py — distributed training configuration |
| 60 | +├── hydra_config/ — Hydra configuration files |
| 61 | +│ ├── defaults.yaml — default training configuration |
| 62 | +│ └── L0_sanity.yaml — quick sanity check configuration |
| 63 | +├── train.parquet — sample training data |
| 64 | +├── requirements.txt — Python dependencies |
| 65 | +└── tests/ — unit and integration tests |
| 66 | +``` |
| 67 | + |
| 68 | +## Installing Dependencies |
| 69 | + |
| 70 | +The easiest way to get started is to use the devcontainer provided in the top-level repository. Alternatively, install |
| 71 | +dependencies manually in an environment with CUDA support: |
| 72 | + |
| 73 | +```bash |
| 74 | +pip install -r requirements.txt |
| 75 | +``` |
| 76 | + |
| 77 | +## Commands to Launch Training |
| 78 | + |
| 79 | +To run single-process training on one GPU: |
| 80 | + |
| 81 | +```bash |
| 82 | +python train_fsdp2.py |
| 83 | +``` |
| 84 | + |
| 85 | +To run multi-process training locally on 2+ GPUs: |
| 86 | + |
| 87 | +```bash |
| 88 | +torchrun --nproc_per_node=2 train_fsdp2.py |
| 89 | +``` |
| 90 | + |
| 91 | +The default configuration (`L0_sanity.yaml`) runs a quick sanity check with 250 steps using the included sample data. |
| 92 | +For real training, create a custom Hydra config or override parameters from the command line. |
| 93 | + |
| 94 | +### Quantized Training (FP8 / NVFP4) |
| 95 | + |
| 96 | +To run training with FP8: |
| 97 | + |
| 98 | +```bash |
| 99 | +python train_fsdp2.py fp8_config.enabled=true |
| 100 | +``` |
| 101 | + |
| 102 | +To train with NVFP4 quantization: |
| 103 | + |
| 104 | +```bash |
| 105 | +python train_fsdp2.py fp4_config.enabled=true |
| 106 | +``` |
| 107 | + |
| 108 | +Additional recipe parameters (e.g., switching to `MXFP8BlockScaling`) can be set via the Hydra configuration. |
| 109 | + |
| 110 | +### Layer-Wise Precision |
| 111 | + |
| 112 | +You can control which transformer layers use FP8 or FP4 by specifying 1-indexed layer numbers via `fp8_layers` and |
| 113 | +`fp4_layers`. Layers not assigned to either format will run in BF16. |
| 114 | + |
| 115 | +For example, to run layers 1-3 in FP8, layers 4-6 in FP4, and the rest in BF16: |
| 116 | + |
| 117 | +```bash |
| 118 | +python train_fsdp2.py \ |
| 119 | + fp8_config.enabled=true \ |
| 120 | + fp4_config.enabled=true \ |
| 121 | + 'fp8_layers=[1,2,3]' \ |
| 122 | + 'fp4_layers=[4,5,6]' |
| 123 | +``` |
| 124 | + |
| 125 | +When both `fp8_config` and `fp4_config` are enabled but only one layer list is provided, the other format automatically |
| 126 | +claims the remaining layers. |
| 127 | + |
| 128 | +### Sequence Packing (THD input format) |
| 129 | + |
| 130 | +Enable sequence packing with: |
| 131 | + |
| 132 | +```bash |
| 133 | +python train_fsdp2.py use_sequence_packing=true |
| 134 | +``` |
| 135 | + |
| 136 | +### FP8 and Sequence Packing |
| 137 | + |
| 138 | +To combine FP8 training with sequence packing: |
| 139 | + |
| 140 | +```bash |
| 141 | +python train_fsdp2.py fp8_config.enabled=true use_sequence_packing=true |
| 142 | +``` |
| 143 | + |
| 144 | +### Quantization Stats Debugging |
| 145 | + |
| 146 | +To enable quantization statistics logging: |
| 147 | + |
| 148 | +```bash |
| 149 | +python train_fsdp2.py \ |
| 150 | + quant_stats_config.enabled=true \ |
| 151 | + quant_stats_config.quant_log_dir=./logs/quant_stats \ |
| 152 | + quant_stats_config.quant_stats_file=./fp8_debugging_stats.yaml \ |
| 153 | + fp8_config.enabled=true |
| 154 | +``` |
| 155 | + |
| 156 | +The config file structure [fp8_debugging_stats.yaml](fp8_debugging_stats.yaml) is explained in the |
| 157 | +[NVIDIA Transformer Engine config file documentation](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/debug/2_config_file_structure.html). |
| 158 | + |
| 159 | +## Saving and Loading Checkpoints |
| 160 | + |
| 161 | +To enable checkpoint saving, ensure that `checkpoint.ckpt_dir` is set to a writable directory: |
| 162 | + |
| 163 | +```bash |
| 164 | +python train_fsdp2.py \ |
| 165 | + checkpoint.ckpt_dir=/path/to/ckpt_dir \ |
| 166 | + checkpoint.save_every_n_steps=100 |
| 167 | +``` |
| 168 | + |
| 169 | +To resume from the latest checkpoint: |
| 170 | + |
| 171 | +```bash |
| 172 | +python train_fsdp2.py \ |
| 173 | + checkpoint.ckpt_dir=/path/to/ckpt_dir \ |
| 174 | + checkpoint.resume_from_checkpoint=true |
| 175 | +``` |
| 176 | + |
| 177 | +A final model suitable for uploading to the Hugging Face Hub can be exported at the end of training by setting |
| 178 | +`checkpoint.save_final_model=true`. |
| 179 | + |
| 180 | +## Developer Guide |
| 181 | + |
| 182 | +### Running Tests |
| 183 | + |
| 184 | +To run tests locally inside the devcontainer: |
| 185 | + |
| 186 | +```bash |
| 187 | +cd bionemo-recipes/recipes/codonfm_native_te |
| 188 | +pytest -v tests/ |
| 189 | +``` |
| 190 | + |
| 191 | +### Hydra Tips |
| 192 | + |
| 193 | +[Hydra](https://hydra.cc/) is used for configuration management. Parameters can be overridden from the command line, |
| 194 | +e.g., `python train_fsdp2.py fp8_config.enabled=true`. For verbose logging, use `hydra.verbose=true`. |
| 195 | + |
| 196 | +## License |
| 197 | + |
| 198 | +Refer to [LICENSE](../../LICENSE). |
0 commit comments