NovaMod

Project background

NovaMod is a research codebase for unsupervised detection of DNA/RNA modifications from Oxford Nanopore sequencing data.

Nanopore signals (ionic current traces) are sensitive to chemical modifications such as DNA 5mC and RNA m6A, but many existing methods are supervised and depend on curated labels. In practice, label generation is expensive and model performance can shift across sequencing chemistry, basecaller versions, and experimental domains.

This project treats modification discovery as an anomaly detection task: learn a generative model of canonical (unmodified) signal, then score deviations as candidate modified events.

Methods summary

The implemented workflow uses a CNN–Transformer variational autoencoder (VAE) and a config-driven streaming training/validation pipeline:

Train on unmodified proxy data (e.g., WGA DNA or IVT RNA) to learn a reference distribution of canonical signal patterns.
Score per-instance anomalies using reconstruction-based error metrics from the trained VAE.
Aggregate read-level evidence to site level for downstream ranking, enrichment analysis, and candidate prioritization.

This is intended as a label-light discovery framework for candidate nomination and regional pattern analysis, rather than a fully supervised end-to-end caller.

Data and code availability

This repository contains the code, configurations, and model artifacts used to reproduce the manuscript’s training and evaluation workflows.

Repository structure

training/
- train.py — main training entrypoint (config-driven).
- val.py — validation/evaluation and anomaly-scoring entrypoint (config-driven).
- train.sh, val.sh — batch wrappers for cluster execution.
- dataset_utils.py, feature_utils.py, bam_utils.py — data loading, feature extraction, and BAM/signal processing utilities.
- data_manifest.csv — dataset bookkeeping used by training/evaluation workflows.
- configs/ — JSON experiment configurations for training and validation runs.
- models/ — model definitions (including CNN–Transformer VAE implementation).
- state_dicts/ — saved model checkpoints and related exported artifacts.
- evaluation.ipynb — analysis notebooks.
scripts/
- preprocessing and workflow scripts for basecalling, alignment, feature generation, and utility conversions.
paper/
- manuscript sources and figures.

Where to find models and scripts

Model code: training/models/
Trained checkpoints/artifacts: training/state_dicts/
Runnable training/evaluation scripts: training/train.py, training/val.py, training/train.sh, training/val.sh
Data-preparation and pipeline helpers: scripts/

Reproducibility quick start

From the repository root:

cd training
python train.py --config configs/train.example.json
python val.py --config configs/val.example.json

Cluster examples:

cd training
sbatch train.sh configs/train.example.json
sbatch val.sh configs/val.example.json

Notes

JSON files in training/configs/ define dataset paths, model settings, and run parameters for each experiment variant.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
paper/figures		paper/figures
scripts		scripts
training		training
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovaMod

Project background

Methods summary

Data and code availability

Repository structure

Where to find models and scripts

Reproducibility quick start

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NovaMod

Project background

Methods summary

Data and code availability

Repository structure

Where to find models and scripts

Reproducibility quick start

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages