JEPA-DNA

JEPA-DNA implements a genomic foundation model that combines generative pretraining with a joint-embedding predictive objective, so the model learns both local nucleotide patterns and broader functional structure from DNA sequences. Benchmarking uses the GFMBench-API evaluation library (install separately; see GFMBench-API clone and PYTHONPATH below).

For reproducibility, the repository includes model-specific paper reproduction parameter files under jepa_dna/reproduce_params_files; see Reproduce paper runs (model-specific params.json).

Quick Start

Installation

Create a virtual environment (choose one option):

Option A: Using pip (venv)

python -m venv jepa_dna_env
source jepa_dna_env/bin/activate

Option B: Using conda

conda create -n jepa_dna_env python=3.10
conda activate jepa_dna_env

Install dependencies:
```
pip install -r requirements.txt
```

(Optional, GPU users) Check CUDA availability:

python -c "import torch; print(torch.cuda.is_available())"

(Optional, for DNABERT-2 with flash attention):
```
pip install flash-attn --no-build-isolation
```
The default requirements.txt targets workflows that use DNABERT-2 and related stacks. Other encoders may need extra packages—install those separately.

(Optional) If you use Hugging Face caches heavily, set cache locations, for example:

export HF_HOME=/path/to/hf_home
export HF_DATASETS_CACHE=/path/to/hf_datasets_cache

GFMBench-API clone and PYTHONPATH

Benchmarks import both this repository and a local GFMBench-API checkout. Clone the evaluation library (not installed as a pip package from this README):

git clone https://github.com/NVIDIA/GFMBench-api.git

Repository: https://github.com/NVIDIA/GFMBench-api

From the JEPA-DNA repository root, point PYTHONPATH at both repo roots:

cd /path/to/JEPA-DNA
export PYTHONPATH=/path/to/JEPA-DNA:/path/to/GFMBench-api

Then verify imports:

python -c "import jepa_dna; import gfmbench_api"

Pretraining

The first (and only) argument is training_dir, a directory you choose for a given run. It plays three roles:

Input: Before starting, the directory must already contain params.json with the training configuration (use jepa_dna/example_params.json as a template).
Output — logs: While training, the pretrainer writes a timestamped log file in this directory, e.g. training_log_<timestamp>.txt.
Output — checkpoints: Final (and periodic, per params.json) weights are saved here, including context_encoder_final.pt, target_encoder_final.pt, and predictor_final.pt (exact filenames depend on objectives). The script may also create auxiliary folders such as git_reproduce/ for reproducibility metadata.

So training_dir is both the config location and the run’s log/checkpoint root; use a dedicated path per experiment so outputs do not overwrite each other.

python ./jepa_dna/run_jepa_pretrain.py /path/to/training_dir

Required path update (Pretraining stage):

Open jepa_dna/run_jepa_pretrain.py.
Locate root_data_dir = "/data/sense/common/data/".
Replace it with the absolute path to your dataset root.

This root is used for the pretraining datasets (for example, pretrain/ under that directory) and checkpoint-time evaluation tasks invoked during training.

To restrict GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./jepa_dna/run_jepa_pretrain.py /path/to/training_dir

Reproduce paper runs (model-specific `params.json`)

Pre-configured parameter files for paper reproduction are provided in:

jepa_dna/reproduce_params_files/params_dnabert2.json
jepa_dna/reproduce_params_files/params_ntv3.json
jepa_dna/reproduce_params_files/params_hyenadna.json

For each run, create a fresh experiment directory, copy the relevant file, rename it to params.json, then launch pretraining.

Example (DNABERT-2):

EXP_DIR=/path/to/experiments/dnabert2_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_dnabert2.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"

Example (NTv3):

EXP_DIR=/path/to/experiments/ntv3_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_ntv3.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"

Example (HyenaDNA):

EXP_DIR=/path/to/experiments/hyenadna_repro
mkdir -p "$EXP_DIR"
cp ./jepa_dna/reproduce_params_files/params_hyenadna.json "$EXP_DIR/params.json"
python ./jepa_dna/run_jepa_pretrain.py "$EXP_DIR"

Notes:

Keep one params.json per experiment directory to avoid accidental overwrites.
The run writes logs/checkpoints into the same EXP_DIR.
If needed, set CUDA_VISIBLE_DEVICES as shown above before the python command.

GFMBench-API evaluation

Run GFMBench tasks through the JEPA-DNA benchmark driver. Example:

python ./jepa_dna/run_benchmark.py \
  --csv_path ./logs/benchmark_report.csv \
  --report_algo_name my_run \
  --model DNABERT2 \
  --checkpoint_path /path/to/optional_checkpoint.pt \
  --epochs 3 \
  --linear_prob \
  --seed 0

Required path update (Benchmark stage):

Open jepa_dna/run_benchmark.py.
Locate root_data_dir_path = "/data/sense/common/data".
Replace it with the absolute path to your GFMBench data root.

Benchmark tasks read datasets relative to this root, so using the wrong path will cause task loading failures.

Flags (see parse_args in jepa_dna/run_benchmark.py):

Flag	Role
`--csv_path`	CSV path for the benchmark report (loads if present, saves here)
`--report_algo_name`	Label for this model in the report
`--model`	One of: `DNABERT2`, `DNABERT`, `NTv3_8M`, `NTv3_100M`, `Caduceus`, `HyenaDNA`
`--checkpoint_path`	Optional checkpoint; omit for default Hugging Face weights
`--epochs`	Fine-tuning epochs for supervised tasks
`--linear_prob`	Train only the head (linear probing) instead of full fine-tuning
`--seed`	Random seed for reproducibility
`--disable_safe_model_call`	Let model errors propagate (debugging)

Repository layout

JEPA-DNA/
├── jepa_dna/
│   ├── run_jepa_pretrain.py    # JEPA pretraining entrypoint
│   ├── run_benchmark.py        # Benchmark driver (GFMBench + JEPA models)
│   ├── example_params.json     # Example training configuration
│   ├── reproduce_params_files/ # Model-specific params files for paper reproduction
│   ├── train/                  # Pretraining and fine-tuning
│   ├── models/                 # Encoders and baselines
│   └── jepa_data/              # Data utilities
├── requirements.txt
├── LICENSE
├── SECURITY.md
└── THIRD_PARTY_NOTICES.md

NOTICE

This project will download and install additional third-party open source software projects, including datasets licensed with non-commercial terms. Review the license terms of these open source projects before use.

Third-party components and attributions: see THIRD_PARTY_NOTICES.md (Python dependencies, Hugging Face assets, reference genomes, and other data URLs). Example: LicenseRef-UCSC-Genome-Browser — https://genome.ucsc.edu/license/

License

NVIDIA-authored code in this repository is licensed under the Apache License, Version 2.0. The full text is in LICENSE.

Each contributed .py file includes NVIDIA SPDX-FileCopyrightText, SPDX-License-Identifier: Apache-2.0, and the short Apache-2.0 notice (through “limitations under the License”), then third-party URL notices scoped to that file (or the line stating that the module does not embed third-party data download URLs). Python package attributions remain in THIRD_PARTY_NOTICES.md. The full Apache-2.0 text is in LICENSE.

Citation

@article{larey2026jepa,
  title={Jepa-dna: Grounding genomic foundation models through joint-embedding predictive architectures},
  author={Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini,  Gideon Rechavi,  Nicole Bussola,  Simon Lee,  Shane O’Connell,  Dung Hoang,  Marissa Wirth,  Alexander W. Charney,  Yoli Shavit,  Nati Daniel},
  journal={arXiv preprint arXiv:2602.17162},
  year={2026}
}

For an open PDF version of the work, see https://arxiv.org/pdf/2602.17162.

Security

For reporting security vulnerabilities, see SECURITY.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JEPA-DNA

Quick Start

Installation

GFMBench-API clone and PYTHONPATH

Pretraining

Reproduce paper runs (model-specific `params.json`)

GFMBench-API evaluation

Repository layout

NOTICE

License

Citation

Security

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
jepa_dna		jepa_dna
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

JEPA-DNA

Quick Start

Installation

GFMBench-API clone and PYTHONPATH

Pretraining

Reproduce paper runs (model-specific params.json)

GFMBench-API evaluation

Repository layout

NOTICE

License

Citation

Security

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Reproduce paper runs (model-specific `params.json`)

Packages