Multi-omics sequencing technologies can jointly measure transcriptome and chromatin accessibility at the single-cell resolution. This enables inference of gene regulatory networks (GRNs) at the cellular level, thereby elucidating highresolution differential GRNs associated with diseases. However, existing methods lack interpretability and scalability. We present single-cell Gene Regulatory Inference with Prior knowledge (scGRIP). First, we treat transcription factors (TF), target genes (TG), and regulatory elements (RE) as nodes and their potential TF-RE and RE-TG interactions as edges using a prior cis-regulatory knowledge graph. Second, we tokenize the single-cell chromatin accessibility and gene expression with a shared codebook to compute cell-specific node embedding. Third, we incorporate a GraphSHAP technique to infer GRN edge attribution at the single-cell level. We benchmarked scGRIP against state-of-the-art methods, including LINGER and scGLUE, across multiple independent datasets. Our results demonstrate that scGRIP consistently outperforms existing approaches at three levels of inference: cell-specific, cell-type-specific, and condition-specific GRNs.
The code expects paired multiome data in AnnData/.h5ad format.
- RNA counts:
RNA_count.h5ad - ATAC counts:
ATAC_count.h5ad - Cell labels in
.obs, typicallycell_type - Peak coordinates in
atac.var, or peak names formatted likechr:start-end - Gene genomic coordinates in
rna.var - cisTarget motif resources for TF-RE prior construction
pip install -r requirements.txtpreprocess/process_re_tg.py creates a gene-to-regulatory-element matrix using genomic proximity and optional correlation/GBM logic.
Key options:
--flag nearby|correlation|nearby+correlation|gbm|both--distance--distance_str--top_n_genes
preprocess/process_tf_re.py combines:
- gene-RE links,
- TF-RE links derived from cisTarget motif scores,
- optional gene-gene correlation edges,
into a final sparse GRN adjacency matrix.
preprocess/train_n2v_sparse.py learns structural node embeddings from the GRN adjacency.
model/train_gnn_xtrimo.py is the main training script. It:
- loads paired RNA/ATAC data,
- loads the GRN adjacency,
- optionally loads Node2Vec embeddings,
- trains the GNN + topic model,
- evaluates clustering quality,
- saves checkpoints and UMAP/t-SNE plots.
The preprocessing scripts use relative paths internally, so the safest approach is to run them from inside their own directories after arranging the expected input files there.
Build gene-RE links:
cd preprocess
python process_re_tg.py --flag nearby --distance 1000000 --distance_str 1m --top_n_genes 3000Assemble the GRN:
cd preprocess
python process_tf_re.py --flag nearby --distance 1000000 --distance_str 1m --top_n_genes 3000 --threshold 3Train Node2Vec:
cd preprocess
python train_n2v_sparse.py --flag nearby --matrix_path ./processed/hvg_only_nearby_with_gene_gene_1m_threshold3_GRN.pklTrain the model:
cd model
python train_gnn_xtrimo.py \
--rna /path/to/RNA_count.h5ad \
--atac /path/to/ATAC_count.h5ad \
--adj_path /path/to/GRN.pkl \
--node2vec_path /path/to/node_embeddings.pt \
--emb_size 128 \
--gnn_hidden 256 \
--num_topics 100 \
--batch_size 32 \
--epochs 100 \
--cell_type_col cell_typeDepending on the stage, the pipeline writes:
- processed ATAC subsets and sparse graph matrices under
preprocess/processed/ - Node2Vec embeddings under
preprocess/processed/ - model checkpoints under
model/weights/ - UMAP/t-SNE plots under
model/plots/umap/
The main Python packages are listed in requirements.txt
