NVIDIA BioNeMo Framework v2.6.2
Updates & Improvements
- Fixes numerous ESM2 model issues:
- Updated base Docker image to nvidia-pytorch 25.04-py3
Known Issues
- Evo2 generation is broken (i.e.
bionemo-evo2/src/bionemo/evo2/run/infer.py). See issue #890. A workaround exists on branch #949 and we are working to fix this issue for the July release. - There is a NCCL communication issue on certain A100 multi-node environments. In our internal testing, we were not able to reproduce the issue reliably across environments. If end users see the following error, please report in issue #970 :
[rank9]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, internal error - please report this issue to the NCCL developers, NCCL version 2.26.3
What's Changed
- Release notes v2.6 by @trvachov in #849
- Bump to version 2.6 by @trvachov in #852
- multi-gpu inference. Adds 'batch index' to the resulting prediction by @skothenhill-nv in #854
- pin ngcsdk by @pstjohn in #857
- fix Evo2 training crash - TE commit by @dorotat-nv in #796
- Update EVO2 tests according to Hyena arch changes by @farhadrgh in #798
- fixing the ESM2 checkpointing issue by @polinabinder1 in #842
- add wandb group and model size in Geneformer configs: benchmarks by @dorotat-nv in #859
- Skip d3pm notebook tests on B200 by @nvdreidenbach in #860
- Bump NeMo to use a trunk commit instead of a branch for Evo2 fixes and inference. by @cspades in #861
- remove unused dependencies from bionemo-core by @pstjohn in #862
- Adding a tflops callback to Geneformer by @polinabinder1 in #856
- Polinabinder/file extend by @polinabinder1 in #477
- Geneformer1B updates by @skothenhill-nv in #869
- upgrade pytorch to 25.04 by @balvisio in #866
- Set EXPERIMENTAL_1b_CHECKPOINT to True by default by @jwilber in #840
- Add Tyler to CODEOWNERS for docs by @jwilber in #880
- Fix missing context in ESM2 FT checkpoint by @farhadrgh in #878
- Fix CI issues on main branch. by @cspades in #868
- Update and separate cell type classification benchmark by @skothenhill-nv in #874
- [BIONEMO-1831] Fix the version of scikit-misc to resolve dependency issue by @balvisio in #883
- Jwilber/1413 unify nb locations by @jwilber in #879
- Fixes how num_layers relates to pipeline_model_parallel_size in ESM2 by @gagank1 in #829
- Fix broken links and add banner by @jwilber in #891
- jwilber/amplify automated benchmarks by @jwilber in #875
- Add small section mentioning context extension by @jwilber in #837
- Add prediction_interval in call to infer_model in infer_esm2.py by @gagank1 in #893
- add flag for loading a sanity-sized dataset for AMPLIFY by @pstjohn in #899
- Fix bionemo.llm.lightning.batch_collator in multi-GPU case by @gagank1 in #898
- fixing vulnerabilities: setuptools and tornado by @dorotat-nv in #902
- Add assertion to zeroshot notebook that the AUC is above a threshold by @jstjohn in #905
- Add 2.6.1 release notes by @jwilber in #912
- add chatbot ui to docs by @jwilber in #845
- Fix masked token loss reductions by @skothenhill-nv in #900
- Final documentation edits by @lvojtku in #894
- Turn on chatbot visibility by default by @jwilber in #915
- fix typo in pretrain.md by @pstjohn in #909
- Add esm2 checkpoint export by @pstjohn in #918
- Geneformer gene embeddings calculation. Now limits the changes to bionemo.geneformer only by @jyin-bst in #808
- Remove strict comparison of tensors against golden values in evo2 test by @balvisio in #901
- Remove cache-to and cache-from in devcontainer by @pstjohn in #913
- Disable moco notebook tests to fix CI by @trvachov in #924
- Fix broken image links in cellgene by @jwilber in #923
- Add cli interface for esm2 checkpoint conversion by @pstjohn in #922
- Reduce number of training steps for partial-conv: esm2 by @dorotat-nv in #929
- Add
create_tensorboard_loggerargument totrain_geneformerentrypoint by @nvmvle in #911 - docs: adds an explanation for the trainer.global step oscillations by @jomitchellnv in #930
- Fixing the missing tfevents dir and catching the issue in testing by @jstjohn in #926
- Add option for number of constant steps of learning rate. by @Sohn123 in #907
- move notebook exclusion to pyproject.toml by @dorotat-nv in #936
- [BIONEMO-2042] Install 'bitsandbytes' with cuda backend by @balvisio in #932
- evo2 stop and go test by @yzhang123 in #903
- Update evo2 ModelCheckpoint args by @jwilber in #935
- test stage specific run_pytest* files to standardise how tests are run in CIs by @dorotat-nv in #889
- Cye/fix test pypi publish by @cspades in #947
- Fix Geneformer test_load_data_run_benchmark by @gagank1 in #942
- Fix esm2 token classification metric and loss, add flip benchmark by @yzhang123 in #946
- fix bug that didn't load the head by @yzhang123 in #950
- Polinabinder/scdl version fixes by @polinabinder1 in #948
- expose esm-2 weight decay parameter by @pstjohn in #956
- Update CODEOWNERS by @malcolmgreaves in #951
- Replace shell commands with subprocess.run by @balvisio in #941
- updating esm2 + geneformer to run benchmarks with data from node specific scratch by @dorotat-nv in #957
- Add guard against zero masked tokens in loss reduction class. by @skothenhill-nv in #958
- scdl neighbor update by @camirr-nv in #843
- Fix esm2 finetune loss by @yzhang123 in #959
New Contributors
- @lvojtku made their first contribution in #894
- @jyin-bst made their first contribution in #808
- @nvmvle made their first contribution in #911
- @Sohn123 made their first contribution in #907
Full Changelog: v2.6.1...v2.6.2