MetagenomicKG

MetagenomicKG is a novel metagenomics knowledge graph that integrates commonly used taxonomic information (GTDB), functional annotations (KEGG), pathogenicity resources (BV-BRC), and other biomedical knowledge.

The associated preprint can be found at: https://www.biorxiv.org/content/10.1101/2024.03.14.585056v1. Please cite via:

Ma, C., Liu, S., & Koslicki, D. (2024). MetagenomicKG: A Knowledge Graph for Metagenomic Applications. bioRxiv, 2024-03.

About Graph

MetagenomicKG integrates knowledge from 7 relevant data sources: GTDB taxonomy, NCBI taxonomy, KEGG, RTX-KG2, BV-BRC, MicroPhenoDB, and NCBI AMRFinderPlus Prediction. It consists of 14 node types and 33 edge types (see statistics in the table below).

Download pre-built MKG

We have provided the pre-built version of MetagenomicKG, you can download the MetagenomicKG.zip file from Zenodo.

Neo4j instance

We also host a neo4j instance for MetagenomicKG: http://mkg.cse.psu.edu:7474/ (Username:neo4j, Password:klabneo4j). If you would like to rebuild it or reproduce the use case reulsts reported in our paper, you can follow the instruction below.

Node Statistics

Node Type	Node Count
Microbe	1,397,747
Disease	95,072
Phenotypic Feature	88,058
KO	26,596
Compound	19,379
Reaction	15,301
Drug	12,461
Glycan	11,222
Enzyme	8,158
AMR	5,438
Drug Group	2,461
Network	1,525
Pathway	569
Module	481
	1,707,968

Edge Statistics

Edge Type	Edge Count
genetically associated with	42,134,400
associated with	13,122,437
subclass of	1,601,473
superclass of	1,423,414
physically interacts with	67,805
has participant	192,823
participates in	189,781
related to	109,266
chemically similar to	62,542
biomarker for	47,675
has phenotype	30,825
treats	5,447
close match	13,919
has part	5,313
catalyzes	157
contraindicated for	2,797
same as	3,628
is sequence variant of	2,201
gene product of	2,201
correlated with	1,671
contributes to	260
temporally related to	273
affects	136
causes	209
has input	168
produces	27
has metabolite	25
actively involved in	21
gene associated with condition	–
disrupts	3
derives from	–
disease has basis in	1
located in	2
	56,206,647

Virtual Environment Installation

We recommend using a virtual environment to ensure a clean and isolated workspace for reproducibility. This can be accomplished using either Conda or Mamba (a faster alternative to Conda).

Using Conda

To create your Conda environment, follow these steps:

# Clone the MetagenomicKG repository
git clone https://github.com/KoslickiLab/MetagenomicKG.git
cd MetagenomicKG

# Please note that the versioin of pytorch we used might not be compatible with your nvidia cuda version. So, please first check your version and change it in metagenomickg_env.yml if needed.
conda env create -f envs/metagenomickg_env.yml

# Download required files for the package 'pytaxonkit'
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz

mkdir -p $HOME/.taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit

# Activate the newly created environment
conda activate metagenomickg_env

Using Mamba

If you prefer using Mamba instead of Conda, just simply repalce conda with mamba in the above commands.

Configuration Setup

MetagenomicKG uses a centralized configuration system through the config.yml file. Before rebuilding MetagenomicKG and replicating use case results, you should configure the system for your environment.

Configuration File Structure

The config.yml file contains several configuration sections:

BUILD_KG_VARIABLES:
  # Required: API keys and data directories
  UMLS_API_KEY: 'your_umls_api_key_here'
  KEGG_FTP_DATA_DIR: '/path/to/your/kegg/data'
  
  # Database configuration
  NODE_SYNONYMIZER_DBNAME: 'node_synonymizer_v1.0_KG2.10.0.sqlite'
  NEO4J_DBNAME: 'MetagenomicsKG'
  
  # Neo4j connection (can be overridden by environment variables)
  NEO4J_BOLT: 'bolt://localhost:7687'
  NEO4J_USERNAME: 'neo4j'
  NEO4J_PASSWORD: 'your_neo4j_password'
  
  # Processing thresholds (can be customized based on your needs)
  ANI_THRESHOLD: 99.5
  AF_THRESHOLD: 0.0
  COVERAGE_THRESHOLD: 80
  IDENTITY_THRESHOLD: 90
  
  # KG file names (customizable for different versions)
  KG_FILES:
    NODES_V1: 'KG_nodes_v1.tsv'
    EDGES_V1: 'KG_edges_v1.tsv'
    # ... additional file versions

Required Configuration Changes

UMLS API Key

Since we utilize the Unified Medical Language System (UMLS) search function via UMLS APIs for identifier mapping, you must first obtain a UMLS API key:

Follow this instruction to get an API key
Replace UMLS_API_KEY with your actual API key in config.yml

KEGG FTP Data

MetagenomicKG includes KEGG data downloaded from KEGG FTP. According to KEGG policy, we cannot provide this dataset:

Follow this instruction to obtain the dataset
Replace KEGG_FTP_DATA_DIR with the path to your KEGG data directory in config.yml

Neo4j Connection (Optional)

Configure Neo4j connection parameters in config.yml or set as environment variables:

Config file approach: Update NEO4J_BOLT, NEO4J_USERNAME, and NEO4J_PASSWORD in config.yml

Environment variables approach (takes precedence over config file):

export neo4j_bolt="bolt://your-neo4j-server:7687"
export neo4j_username="your_username"
export neo4j_password="your_password"

Optional Configuration Customization

Processing Thresholds

You can adjust processing thresholds based on your requirements:

ANI_THRESHOLD: Average Nucleotide Identity threshold for strain identification (default: 99.5)
AF_THRESHOLD: Alignment Fraction threshold for strain identification (default: 0.0)
COVERAGE_THRESHOLD: Coverage threshold for AMR gene selection (default: 80)
IDENTITY_THRESHOLD: Identity threshold for AMR gene selection (default: 90)

Database and File Names

The configuration system allows you to customize:

Database names (NODE_SYNONYMIZER_DBNAME, NEO4J_DBNAME)
KG file names for different processing versions
Output file names for final Neo4j import

Build MetagenomicKG

We constructed an automatic pipeline to rebuild MetagenomicKG via Snakemake. Since MetagenomicKG uses RTX-KG2, which includes UMLS data, you need to contact authors to demonstrate that you have accepted the license terms in order to get access to download KG2. We will provide you with a password so you can download the file here. Note: you do not need this file to view/interact with the MetagenomicsKG, it's just for if you want to rebuild it. Once you have access, please download and put the kg2c-tsv.tar.gz file to ./data/RTX_KG2 folder.

After downloading the RTX-KG2 TSV files is done, you can run the pipeline via:

snakemake --cores 16 -s run_buildKG_pipeline.smk targets

Once it is completed, you can find merged node and edge TSV files (KG_nodes_v6.tsv and KG_edges_v6.tsv) from ./data/merged_KG folder.

Replicate Use Case1 Hypothesis Generation and Exploration

Once you login the neo4j instance of MetagenomicKG (see login info in About Graph Section), you can use the following Neo4j Cypher Query to replicate what we show in the paper. In this query, we find at most 10 paths of protein - pathogen with name 'Staphylococcus aureus' - KO - pathway - disease - drug.

MATCH p=(n0:`biolink:Protein`)-[]-(n1:`biolink:OrganismTaxon`)-[]-(n2:`biolink:BiologicalEntity`)-[]-(n3:`biolink:Pathway`)-[]-(n4:`biolink:Disease`)-[]-(n5:`biolink:Drug`)
WHERE n1.is_pathogen = "True" and ANY (n1_names IN n1.all_names WHERE n1_names contains 'Staphylococcus aureus')
RETURN p LIMIT 10

Replicate Use Case2 Sample-specific Graph Embeddings

To replicate the results of use case 2, you can simply run the Snakemake pipelie via:

snakemake --cores 16 -s run_usecase2_pipeline.smk targets

Replicate Use Case3 Pathogen Predictions

To replicate the results of use case 3, you can simply run the Snakemake pipelie via:

snakemake --cores 16 -s run_usecase3_pipeline.smk targets

Contact

If you have any questions or need help, please contact @chunyuma or @ShaopengLiu1 or @stephwon or @dkoslicki.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
benchmarking		benchmarking
build_KG		build_KG
data		data
docs		docs
envs		envs
run_AMRFinderPlus		run_AMRFinderPlus
run_GTDB_tk		run_GTDB_tk
usecase2_embeddings		usecase2_embeddings
usecase3_pathogen_identification		usecase3_pathogen_identification
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
run_buildKG_pipeline.smk		run_buildKG_pipeline.smk
run_usecase2_pipeline.smk		run_usecase2_pipeline.smk
run_usecase3_pipeline.smk		run_usecase3_pipeline.smk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetagenomicKG

Table of contents

About Graph

Download pre-built MKG

Neo4j instance

Virtual Environment Installation

Using Conda

Using Mamba

Configuration Setup

Configuration File Structure

Required Configuration Changes

UMLS API Key

KEGG FTP Data

Neo4j Connection (Optional)

Optional Configuration Customization

Processing Thresholds

Database and File Names

Build MetagenomicKG

Replicate Use Case1 Hypothesis Generation and Exploration

Replicate Use Case2 Sample-specific Graph Embeddings

Replicate Use Case3 Pathogen Predictions

Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

KoslickiLab/MetagenomicKG

Folders and files

Latest commit

History

Repository files navigation

MetagenomicKG

Table of contents

About Graph

Download pre-built MKG

Neo4j instance

Virtual Environment Installation

Using Conda

Using Mamba

Configuration Setup

Configuration File Structure

Required Configuration Changes

UMLS API Key

KEGG FTP Data

Neo4j Connection (Optional)

Optional Configuration Customization

Processing Thresholds

Database and File Names

Build MetagenomicKG

Replicate Use Case1 Hypothesis Generation and Exploration

Replicate Use Case2 Sample-specific Graph Embeddings

Replicate Use Case3 Pathogen Predictions

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages