Skip to content

NVIDIA-Digital-Bio/JEPA-DNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JEPA-DNA

JEPA-DNA implements a genomic foundation model that combines generative pretraining with a joint-embedding predictive objective, so the model learns both local nucleotide patterns and broader functional structure from DNA sequences. Benchmarking uses the GFMBench-API evaluation library (install separately; see GFMBench-API clone and PYTHONPATH below).

For reproducibility, the repository includes model-specific paper reproduction parameter files under jepa_dna/reproduce_params_files; see Reproduce paper runs (model-specific params.json).

Quick Start

Installation

  1. Create a virtual environment (choose one option):

    Option A: Using pip (venv)

    python -m venv jepa_dna_env
    source jepa_dna_env/bin/activate

    Option B: Using conda

    conda create -n jepa_dna_env python=3.10
    conda activate jepa_dna_env
  2. Install dependencies:

    pip install -r requirements.txt
  3. (Optional, GPU users) Check CUDA availability:

    python -c "import torch; print(torch.cuda.is_available())"
  4. (Optional, for DNABERT-2 with flash attention):

    pip install flash-attn --no-build-isolation

    The default requirements.txt targets workflows that use DNABERT-2 and related stacks. Other encoders may need extra packages—install those separately.

  5. (Optional) If you use Hugging Face caches heavily, set cache locations, for example:

    export HF_HOME=/path/to/hf_home
    export HF_DATASETS_CACHE=/path/to/hf_datasets_cache

GFMBench-API clone and PYTHONPATH

Benchmarks import both this repository and a local GFMBench-API checkout. Clone the evaluation library (not installed as a pip package from this README):

git clone https://github.com/NVIDIA/GFMBench-api.git

Repository: https://github.com/NVIDIA/GFMBench-api

From the JEPA-DNA repository root, point PYTHONPATH at both repo roots:

cd /path/to/JEPA-DNA
export PYTHONPATH=/path/to/JEPA-DNA:/path/to/GFMBench-api

Then verify imports:

python -c "import jepa_dna; import gfmbench_api"

Pretraining

The first (and only) argument is training_dir, a directory you choose for a given run. It plays three roles:

  1. Input: Before starting, the directory must already contain params.json with the training configuration (use jepa_dna/example_params.json as a template).
  2. Output — logs: While training, the pretrainer writes a timestamped log file in this directory, e.g. training_log_<timestamp>.txt.
  3. Output — checkpoints: Final (and periodic, per params.json) weights are saved here, including context_encoder_final.pt, target_encoder_final.pt, and predictor_final.pt (exact filenames depend on objectives). The script may also create auxiliary folders such as git_reproduce/ for reproducibility metadata.

So training_dir is both the config location and the run’s log/checkpoint root; use a dedicated path per experiment so outputs do not overwrite each other.

python ./jepa_dna/run_jepa_pretrain.py /path/to/training_dir

Required path update (Pretraining stage):

  • Open jepa_dna/run_jepa_pretrain.py.
  • Locate root_data_dir = "/data/sense/common/data/".
  • Replace it with the absolute path to your dataset root.

This root is used for the pretraining datasets (for example, pretrain/ under that directory) and checkpoint-time evaluation tasks invoked during training.

To restrict GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./jepa_dna/run_jepa_pretrain.py /path/to/training_dir

Reproduce paper runs (model-specific params.json)

Pre-configured parameter files for paper reproduction are provided in:

  • jepa_dna/reproduce_params_files/params_dnabert2.json
  • jepa_dna/reproduce_params_files/params_ntv3.json
  • jepa_dna/reproduce_params_files/params_hyenadna.json

For each run, create a fresh experiment directory, copy the relevant file, rename it to params.json, then launch pretraining.

Example (DNABERT-2):

EXP_DIR=/path/to/experiments/dnabert2_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_dnabert2.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"

Example (NTv3):

EXP_DIR=/path/to/experiments/ntv3_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_ntv3.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"

Example (HyenaDNA):

EXP_DIR=/path/to/experiments/hyenadna_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_hyenadna.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"

Notes:

  • Keep one params.json per experiment directory to avoid accidental overwrites.
  • The run writes logs/checkpoints into the same EXP_DIR.
  • If needed, set CUDA_VISIBLE_DEVICES as shown above before the python command.

GFMBench-API evaluation

Run GFMBench tasks through the JEPA-DNA benchmark driver. Example:

python ./jepa_dna/run_benchmark.py \
  --csv_path ./logs/benchmark_report.csv \
  --report_algo_name my_run \
  --model DNABERT2 \
  --checkpoint_path /path/to/optional_checkpoint.pt \
  --epochs 3 \
  --linear_prob \
  --seed 0

Required path update (Benchmark stage):

  • Open jepa_dna/run_benchmark.py.
  • Locate root_data_dir_path = "/data/sense/common/data".
  • Replace it with the absolute path to your GFMBench data root.

Benchmark tasks read datasets relative to this root, so using the wrong path will cause task loading failures.

Flags (see parse_args in jepa_dna/run_benchmark.py):

Flag Role
--csv_path CSV path for the benchmark report (loads if present, saves here)
--report_algo_name Label for this model in the report
--model One of: DNABERT2, DNABERT, NTv3_8M, NTv3_100M, Caduceus, HyenaDNA
--checkpoint_path Optional checkpoint; omit for default Hugging Face weights
--epochs Fine-tuning epochs for supervised tasks
--linear_prob Train only the head (linear probing) instead of full fine-tuning
--seed Random seed for reproducibility
--disable_safe_model_call Let model errors propagate (debugging)

Repository layout

JEPA-DNA/
├── jepa_dna/
│   ├── run_jepa_pretrain.py    # JEPA pretraining entrypoint
│   ├── run_benchmark.py        # Benchmark driver (GFMBench + JEPA models)
│   ├── example_params.json     # Example training configuration
│   ├── reproduce_params_files/ # Model-specific params files for paper reproduction
│   ├── train/                  # Pretraining and fine-tuning
│   ├── models/                 # Encoders and baselines
│   └── jepa_data/              # Data utilities
├── requirements.txt
├── LICENSE
├── SECURITY.md
└── THIRD_PARTY_NOTICES.md

NOTICE

This project will download and install additional third-party open source software projects, including datasets licensed with non-commercial terms. Review the license terms of these open source projects before use.

Third-party components and attributions: see THIRD_PARTY_NOTICES.md (Python dependencies, Hugging Face assets, reference genomes, and other data URLs). Example: LicenseRef-UCSC-Genome-Browser — https://genome.ucsc.edu/license/

License

NVIDIA-authored code in this repository is licensed under the Apache License, Version 2.0. The full text is in LICENSE.

Each contributed .py file includes NVIDIA SPDX-FileCopyrightText, SPDX-License-Identifier: Apache-2.0, and the short Apache-2.0 notice (through “limitations under the License”), then third-party URL notices scoped to that file (or the line stating that the module does not embed third-party data download URLs). Python package attributions remain in THIRD_PARTY_NOTICES.md. The full Apache-2.0 text is in LICENSE.

Citation

@article{larey2026jepa,
  title={Jepa-dna: Grounding genomic foundation models through joint-embedding predictive architectures},
  author={Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini,  Gideon Rechavi,  Nicole Bussola,  Simon Lee,  Shane O’Connell,  Dung Hoang,  Marissa Wirth,  Alexander W. Charney,  Yoli Shavit,  Nati Daniel},
  journal={arXiv preprint arXiv:2602.17162},
  year={2026}
}

For an open PDF version of the work, see https://arxiv.org/pdf/2602.17162.

Security

For reporting security vulnerabilities, see SECURITY.md.

About

The code for JEPA-DNA implements a genomic foundation model that combines generative pretraining with a joint-embedding predictive objective, enabling the model to learn both local nucleotide patterns and global functional representations from DNA sequences.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages