JEPA-DNA implements a genomic foundation model that combines generative pretraining with a joint-embedding predictive objective, so the model learns both local nucleotide patterns and broader functional structure from DNA sequences. Benchmarking uses the GFMBench-API evaluation library (install separately; see GFMBench-API clone and PYTHONPATH below).
For reproducibility, the repository includes model-specific paper reproduction parameter files under jepa_dna/reproduce_params_files; see Reproduce paper runs (model-specific params.json).
-
Create a virtual environment (choose one option):
Option A: Using pip (venv)
python -m venv jepa_dna_env source jepa_dna_env/bin/activateOption B: Using conda
conda create -n jepa_dna_env python=3.10 conda activate jepa_dna_env
-
Install dependencies:
pip install -r requirements.txt
-
(Optional, GPU users) Check CUDA availability:
python -c "import torch; print(torch.cuda.is_available())" -
(Optional, for DNABERT-2 with flash attention):
pip install flash-attn --no-build-isolation
The default
requirements.txttargets workflows that use DNABERT-2 and related stacks. Other encoders may need extra packages—install those separately. -
(Optional) If you use Hugging Face caches heavily, set cache locations, for example:
export HF_HOME=/path/to/hf_home export HF_DATASETS_CACHE=/path/to/hf_datasets_cache
Benchmarks import both this repository and a local GFMBench-API checkout. Clone the evaluation library (not installed as a pip package from this README):
git clone https://github.com/NVIDIA/GFMBench-api.gitRepository: https://github.com/NVIDIA/GFMBench-api
From the JEPA-DNA repository root, point PYTHONPATH at both repo roots:
cd /path/to/JEPA-DNA
export PYTHONPATH=/path/to/JEPA-DNA:/path/to/GFMBench-apiThen verify imports:
python -c "import jepa_dna; import gfmbench_api"The first (and only) argument is training_dir, a directory you choose for a given run. It plays three roles:
- Input: Before starting, the directory must already contain
params.jsonwith the training configuration (usejepa_dna/example_params.jsonas a template). - Output — logs: While training, the pretrainer writes a timestamped log file in this directory, e.g.
training_log_<timestamp>.txt. - Output — checkpoints: Final (and periodic, per
params.json) weights are saved here, includingcontext_encoder_final.pt,target_encoder_final.pt, andpredictor_final.pt(exact filenames depend on objectives). The script may also create auxiliary folders such asgit_reproduce/for reproducibility metadata.
So training_dir is both the config location and the run’s log/checkpoint root; use a dedicated path per experiment so outputs do not overwrite each other.
python ./jepa_dna/run_jepa_pretrain.py /path/to/training_dirRequired path update (Pretraining stage):
- Open
jepa_dna/run_jepa_pretrain.py. - Locate
root_data_dir = "/data/sense/common/data/". - Replace it with the absolute path to your dataset root.
This root is used for the pretraining datasets (for example, pretrain/ under that directory) and checkpoint-time evaluation tasks invoked during training.
To restrict GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./jepa_dna/run_jepa_pretrain.py /path/to/training_dirPre-configured parameter files for paper reproduction are provided in:
jepa_dna/reproduce_params_files/params_dnabert2.jsonjepa_dna/reproduce_params_files/params_ntv3.jsonjepa_dna/reproduce_params_files/params_hyenadna.json
For each run, create a fresh experiment directory, copy the relevant file, rename it to params.json, then launch pretraining.
Example (DNABERT-2):
EXP_DIR=/path/to/experiments/dnabert2_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_dnabert2.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"Example (NTv3):
EXP_DIR=/path/to/experiments/ntv3_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_ntv3.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"Example (HyenaDNA):
EXP_DIR=/path/to/experiments/hyenadna_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_hyenadna.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"Notes:
- Keep one
params.jsonper experiment directory to avoid accidental overwrites. - The run writes logs/checkpoints into the same
EXP_DIR. - If needed, set
CUDA_VISIBLE_DEVICESas shown above before thepythoncommand.
Run GFMBench tasks through the JEPA-DNA benchmark driver. Example:
python ./jepa_dna/run_benchmark.py \
--csv_path ./logs/benchmark_report.csv \
--report_algo_name my_run \
--model DNABERT2 \
--checkpoint_path /path/to/optional_checkpoint.pt \
--epochs 3 \
--linear_prob \
--seed 0Required path update (Benchmark stage):
- Open
jepa_dna/run_benchmark.py. - Locate
root_data_dir_path = "/data/sense/common/data". - Replace it with the absolute path to your GFMBench data root.
Benchmark tasks read datasets relative to this root, so using the wrong path will cause task loading failures.
Flags (see parse_args in jepa_dna/run_benchmark.py):
| Flag | Role |
|---|---|
--csv_path |
CSV path for the benchmark report (loads if present, saves here) |
--report_algo_name |
Label for this model in the report |
--model |
One of: DNABERT2, DNABERT, NTv3_8M, NTv3_100M, Caduceus, HyenaDNA |
--checkpoint_path |
Optional checkpoint; omit for default Hugging Face weights |
--epochs |
Fine-tuning epochs for supervised tasks |
--linear_prob |
Train only the head (linear probing) instead of full fine-tuning |
--seed |
Random seed for reproducibility |
--disable_safe_model_call |
Let model errors propagate (debugging) |
JEPA-DNA/
├── jepa_dna/
│ ├── run_jepa_pretrain.py # JEPA pretraining entrypoint
│ ├── run_benchmark.py # Benchmark driver (GFMBench + JEPA models)
│ ├── example_params.json # Example training configuration
│ ├── reproduce_params_files/ # Model-specific params files for paper reproduction
│ ├── train/ # Pretraining and fine-tuning
│ ├── models/ # Encoders and baselines
│ └── jepa_data/ # Data utilities
├── requirements.txt
├── LICENSE
├── SECURITY.md
└── THIRD_PARTY_NOTICES.md
This project will download and install additional third-party open source software projects, including datasets licensed with non-commercial terms. Review the license terms of these open source projects before use.
Third-party components and attributions: see THIRD_PARTY_NOTICES.md (Python dependencies, Hugging Face assets, reference genomes, and other data URLs). Example: LicenseRef-UCSC-Genome-Browser — https://genome.ucsc.edu/license/
NVIDIA-authored code in this repository is licensed under the Apache License, Version 2.0. The full text is in LICENSE.
Each contributed .py file includes NVIDIA SPDX-FileCopyrightText, SPDX-License-Identifier: Apache-2.0, and the short Apache-2.0 notice (through “limitations under the License”), then third-party URL notices scoped to that file (or the line stating that the module does not embed third-party data download URLs). Python package attributions remain in THIRD_PARTY_NOTICES.md. The full Apache-2.0 text is in LICENSE.
@article{larey2026jepa,
title={Jepa-dna: Grounding genomic foundation models through joint-embedding predictive architectures},
author={Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini, Gideon Rechavi, Nicole Bussola, Simon Lee, Shane O’Connell, Dung Hoang, Marissa Wirth, Alexander W. Charney, Yoli Shavit, Nati Daniel},
journal={arXiv preprint arXiv:2602.17162},
year={2026}
}For an open PDF version of the work, see https://arxiv.org/pdf/2602.17162.
For reporting security vulnerabilities, see SECURITY.md.