Tutorial

Environment installation

Install with Conda environment

Create conda enviroment, test under conda 25.1.1

conda create -n meta_fr python=3.8 r-base=4.2 -c conda-forge
conda activate meta_fr

Install required python package

pip install networkx==2.8.7
pip install ipykernel==5.3.4
pip install ipython==8.12.3
pip install ipython-genutils==0.2.0
pip install matplotlib
pip install pandas==1.1.3
pip install statsmodels==0.14.0
pip install svglib
pip install scikit-learn==1.1.2
pip install scikit-learn-extra==0.2.0
pip install scikit-network==0.27.1
pip install scipy==1.10.1
pip install seaborn==0.12.0
pip install reportlab==3.6.12
pip install lifelines==0.27.8
pip install cliffs-delta
pip install pyseat
pip install numpy==1.22.4
pip install pandas==1.5.2
pip install matplotlib_venn
python -m ipykernel install --user --name meta_fr --display-name "Python (meta_fr)"

On your jupyter notebook, choose kernel Python (meta_fr)

💡Note: PySEAT have conflict with numpy version. Please use numpy = 1.22.4 and ignore the warning shows on when you install as

pyseat 0.0.1.4 requires numpy>=1.23.3, but you have numpy 1.22.4 which is incompatible.

Install required R package

conda install r-effsize r-ggplot2 r-ggpubr r-svglite r-reshape2 r-dplyr r-tidyr r-readxl r-randomForest r-pROC

Input files

git clone https://github.com/deepomicslab/FR_Hierarchy_Gut
cd FR_Hierarchy_Gut/

Then please unzip data.zip. You will see /data directory.

data/
├── gcn2008.tsv                                  # GCN of 2008 species
├── sp_d.tsv                                     # Precomputed distance matrix for 2008 species in GCN
├── module_def0507.tsv                           # Definition of module in KEGG
├── cMD.select_2008.select_genome.list           # Genomes to create GCN2008
├── cMD.select_2008.tax.fullname.txt             # Full taxonomy of species
├── cMD.select_2008.species_phylum.tsv           # Species phylum matching
│
├── [ACVD, CRC, asthma, carcinoma_surgery_history, STH, migraine, BD, IBD, 
│   T2D, hypertension, CFS, IGT, adenoma, schizofrenia]/  # Disease categories
│   ├── [cohort_name1, cohort_name2, ...]/                # Multiple cohorts per disease
│   │   ├── metadata.tsv                                  # Metadata (disease in header)
│   │   └── abd.tsv                                       # Abundance profile (species × samples)
│
├── NAFLD/                                       # NAFLD dataset
│   ├── NASH_forward_63_map.txt                  # Metadata of phenotypes for NASH dataset
│   ├── abd.tsv                                  # 16S species level profile
│   ├── NASH_GCN.tsv                             # GCN of NASH for 16S species name
│   └── taxonomy.tsv                             # Class family species matching
│
├── Anti/                                        # Antibiotic treatment dataset
│   ├── metadata.tsv                             # Metadata
│   ├── abd.tsv                                  # Abundance profile
│   ├── Anti.compare.list                        # Abundance profile
│   ├── Anti.group.tsv                           # Abundance profile
│   └── Antibiotic.diversity.Frederic.tsv        # Abundance profile
│
├── FMT/                                         # Fecal microbiota transplantation dataset
│   ├── FMT1/
│   │   ├── LiSS_2016.tsv                        # Species profile (index: species, header: sample name)
│   │   └── Li.txt                               # Fraction of donor specific strains
│   └── FMT2/
│       ├── Eric_abd.tsv                         # Species level profile
│       └── Eric.txt                             # Fraction of donor specific strains
│
└── NSCLC/                                        # Immunotherapy dataset
    ├── merged_species.tsv                       # Species level abundance profile
    ├── sig.txt                                  # Classification of species in original work
    ├── metadata.txt                             # Metadata including cohort
    ├── DS1_oncology_clinical_data.csv           # Metadata including death, os, akk in original work
    └── DS5_longitudinal_clinical_data.csv       # Metadata including akk level in original work

💡
data/NAFLD/* from doi: 10.1002/imt2.61
data/FMT/FMT1/Li.txt from doi: 10.1038/s41467-020-19940-1
data/FMT/FMT2/Eric.txt from doi: 10.1038/s41467-020-19940-1
data/NSCLC/DS* from doi: 10.1016/j.cell.2024.05.029
data/NSCLC/sig.txt from doi: 10.1016/j.cell.2024.05.029

Scripts

We highly recommend running the scripts in the directory sequentially in the following order.

1. Prior GCN structure (01.script_priori_tree/)

Scripts of manuscript section Constructing a priori functional redundancy hierarchical structure of species via structural entropy

a. Compute species distance from GCN [optional]

01.script_priori_tree/a.compute_distance.ipynb

If you want to start the analysis from GCN, please run this script first to compute distance matrix, which will result as sp_d.tsv. This may take some time (around 20 mins). To save time, you can directly use sp_d.tsv in /data directory which is preproduced.

input: ../data/gcn2008.tsv GCN of 2008 species
output: ../data/sp_d.tsv Distance matrix

b. Constructing a priori functional redundancy hierarchical structure of species via structural entropy

01.script_priori_tree/b.GCN_tree.ipynb

💡please run this script before FMT, NSCLC, Antibiotic, NSCLC which depend on the prior sturcture.

inputs:
- data/gcn2008.tsv
- data/sp_d.tsv
outputs:
- result/GCN_fix_tree/
  - renamed_GCN_tree.newick.tsv Tree structure in newick format
  - leaves_cluster.tsv Species FRC annotation

🔍 Preview of leaves_cluster.tsv

species	cluster	supercluster
s__Rhodococcus_fascians	S2_C1	S2
s__Nocardia_farcinica	S2_C1	S2
s__Rhodococcus_hoagii	S2_C1	S2

c. Detect FRC/supercluster enriched/depleted KOs

01.script_priori_tree/c.KO_compare.ipynb Using S1-C8 as example.

inputs:
- data/gcn2008.tsv
- result/GCN_fix_tree/leaves_cluster.tsv
outputs:
- result/GCN_fix_tree/
  - S1_C8.kos_summary.tsv Statistic of KOs present in S1-C8
  - S1_C8.kos_fisher.tsv Fisher testing results

🔍 Preview of S1_C8.kos_fisher.tsv

KO	S1_C8 Present	S1_C8 Absent	Non S1_C8 Present	Non S1_C8 Absent	Odds Ratio	P-value	Adjusted P-value
K03648	6	48	1706	248	1.82E-02	4.47E-35	2.63E-31
K00560	5	49	1576	378	2.45E-02	1.30E-28	3.80E-25
K02837	6	48	1543	411	3.33E-02	1.54E-25	3.01E-22

Evaluation of GCN

01.script_priori_tree/util_evaluation.ipynb Evaluate the feature of GCN following original study.

inputs:
- data/gcn2008.tsv
- data/sp_d.tsv
outputs:
- result/GCN_evaluation/evaluation.png The plot of evaluation result

2. Completeness of FRC (02.script_signature_modules/)

Result of manuscript section Functional redundancy hierarchical structure reveals species clusters with distinct functions

a. Compute the module completeness of each taxon in GCN2008

02.script_signature_modules/a.genome_module_completeness.ipynb

input:
- data/module_def0507.tsv
- data/gcn2008.tsv
output:
- result/signature_modules/genome_module.completeness.tsv Genome module completenees matrix, with corresponding species name as rownames, with KEGG modules as column.

c.Signature modules of superclusters/FRCs

02.script_signature_modules/b.signature_modules.ipynb (require 02.script_signature_modules/cluster_completeness_testing.R)

input:
- result/GCN_fix_tree/leaves_cluster.tsv
- result/signature_modules/genome_module.completeness.tsv
output:
- result/signature_modules/
  - *_species.tsv Species involved in comparison with FRC/superclusters annotation
  - *.genome_module.completeness.tsv Split genome module completeness of each supercluster
  - *.module_comp.wilcox.testing.tsv Testing results of module completeness comparison
  - cluster_module_signature.tsv Summary of signature modules of superclusters/FRCs.

3. FMT (03.script_FMT/)

Scripts of manuscript section Structural entropy of vitamin $K_1$, $K_2$ and $B_2$ biosynthesis FRC in the recipient decreased the fecal microbiota transplantation engraftment efficiency

GCN_fix_tree result is required

input:
- result/GCN_fix_tree/renamed_GCN_tree.newick
- ../data/sp_d.tsv
  For FMT1:
- data/FMT/FMT1/metadata.tsv
- data/FMT/FMT1/fmt_abd.tsv
- data/FMT/FMT1/Li.txt
  For FMT2:
- data/FMT/FMT2/Eric.txv
- data/FMT/FMT2/deltat.txt
- data/FMT/FMT2/triads.txt
- data/FMT/FMT2/Eric_abd.tsv

a.Mutiple regression on nFR

03.script_FMT/a.analysis_nfr*.ipynb Mutiple regression on nFR, days after FMT and fraction at each FRC/supercluster.

b.Mutiple regression on SE value

03.script_FMT/b.analysis_se*.ipynb Mutiple regression on SE value, days after FMT and fraction at each FRC/supercluster.

c.Mutiple regression on FR

03.script_FMT/c.analysis_fr*.ipynb Mutiple regression on FR, days after FMT and fraction at each cluster/supercluster.

Output
- result/FMT/*/*/ (First * can be nFR/SE/FR, second * can be FMT1/FMT2)
  - [cluster].tsv Regression plot data
  - [cluster].pdf Plot of regression
  - p_values.tsv F-test p-values of regression, coefficient and its p-values

🔍 Preview of [cluster].tsv

sample	SE_pre	t_post	f_ds
FMT1	0.79168257	2	0.302325581
FMT1	0.83223	2	0.233333

🔍 Preview of pvalues.tsv

	F-pvalue	se_co	t_co	const_co	se_p	t_p	const_p
cluster_S1-C3	0.003328	-0.94618	-0.00035	0.514195	0.00079	0.7069	2.80E-15
cluster_S1-C15	0.019	-1.4490	-0.00035	0.515	0.005268	0.7230	2.90E-14

d. Compute FR at each cluster/supercluster for each timepoint

03.script_FMT/d.analysis_compute_fr*.ipynb

Output
- result/FMT/FR_timepoints/*/ (* can be FMT1/FMT2)
  - fr.tsv FR values of each sample at each timepoint

e. Mutiple regression on fd/td/nFR at root

03.script_FMT/e.root_*.ipynb Mutiple regression on fd/td/nFR at root, days after FMT and fraction only at root.

Output
- result/FMT/root/*/ (* can be FMT1/FMT2, here use fd as an exsample, nfr and td are similar)
  - fd.tsv fd values of each sample
  - fd_root.pdf Plot of regression of fd value
  - fd_p_values.tsv F-test p-values of regression, coefficient and its p-values

f. merge and output result

03.script_FMT/f.merge_S4.ipynb

Output
- result/FMT
  - supp_FMT.tsv Regression result for nFR and SE in the two cohorts.
    Results as Supplementary Table S4

4. Antibiotic treatment (04.script_Antibiotic/)

Scripts of manuscript section Low preservation of FRCs in the initial state leads to distinct reshaping of the gut microbiome after cefprozil exposure

GCN_fix_tree result is required

a. nFR analysis

04.script_Antibiotic/a.analysis_nFR.ipynb

input:
- data/sp_d.tsv
- result/GCN_fix_tree/renamed_GCN_tree.newick
- data/Anti/metadata.csv
- data/Anti/abd.csv
output:
- result/Anti/nFR
  - nfr_df.tsv nFR value of each FRC at each timepoints for each sample
  - cluster_[FRC].pdf Plot nFR value boxplot of the FRC at three timepoints
  - p_value.tsv nFR differential test p-values between exposed and control group at each timepoint for each FRC

b. SE analysis

04.script_Antibiotic/b.analysis_SE.ipynb

input:
- data/sp_d.tsv
- result/GCN_fix_tree/renamed_GCN_tree.newick
- data/Anti/metadata.csv
- data/Anti/abd.csv
output:
- result/Anti/SE
  - se_df.tsv SE value of each FRC at each timepoints for each sample
  - cluster_[FRC].pdf Plot SE value boxplot of the FRC at three timepoints
  - p_value.tsv SE differential test p-values between exposed and control group at each timepoint for each FRC

c. Differential testing of SE/nFR

04.script_Antibiotic/c.fr_differential_testing.ipynb

input:
- result/Anti/nFR/nfr_df.tsv
- result/Anti/SE/se_df.tsv
- data/Anti/Anti.group.tsv Group information of samples
output:
- result/Anti/nFR/nfr.EB_EN.differential.tsv
- result/Anti/SE/SE.EB_EN.differential.tsv

Results as Supplementary Table S5

🔍 Preview of SE.EB_EN.differential.tsv

FR	Group1	Group2	Cluster	p_value	enriched	mean_g1	mean_g2
SE	EB_7	EN_7	cluster_S1-C1	0.0135	EB_7	0.3299	0.0997
SE	EB_7	EN_7	cluster_S1-C8	0.0415	EN_7	0.0140	0.1108
SE	EB_7	EN_7	cluster_S3-C1	0.0296	EB_7	0.0043	0.0004

d. Eigenspecies analysis

04.script_Antibiotic/d.eigenspecies.ipynb (require 04.script_Antibiotic/eigenspecies_utils.py)

prepare group file for comparison pairs, two groups in one comparison
calculate eigenspecies of all FRCs in all samples in two groups
construct eigenspecies correlation network for two groups respectively
preservation matrix of correlation matrices between two groups
compare eigenspecies networks difference between two groups
input:
- data/Anti/Anti.group.tsv Group information of samples
- data/Anti/Anti.compare.list Comparision list of groups, e.g EB0 EN0
- result/GCN_fix_tree/leaves_cluster.tsv
- data/Anti/abd.tsv
output for given group {g1} and group {g2}:
- result/Anti/eigenspecies
  - {g1}.{g2}.group.tsv Samples of two groups
  - {g1}.{g2}.eigenspecies.csv Eigenspecies of FRC
  - {g1}.{g2}.eigenspecies_cor.{g1}.tsv Eigenspecies correlation network of {g1}
  - {g1}.{g2}.eigenspecies_cor.{g2}.tsv Eigenspecies correlation network of {g2}
  - {g1}.{g2}.preserv_matrix.tsv Preservation matrix of two eigenspecies correlation networks
  - {g1}.{g2}.preserv_matrix.png Visualization of preservation matrix
  - {g1}.{g2}.compare_eigenspecies_networks.tsv Differential testing of FRC eigenspecies between two groups

e. Correlation between eigenspecies and taxonomic diversity

04.script_Antibiotic/e.correlation_diversity.ipynb

input:
- data/Anti/Antibiotic.diversity.Frederic.tsv Taxonomic diversity provided in 10.1038/ismej.2015.148 Supptable1
- result/Anti/eigenspecies/EB_0.EN_0.eigenspecies.csv Eigenspecies of EB and EN at day0.
output:
- Correlation of FRC and diversity with p-value in notebook.

5. NAFLD (05.script_NAFLD/)

Scripts of manuscript section FR keystone species in personalized FR network reveals polycentric structure in healthy individuals and monocentric in non-alcoholic steatohepatitis patients

a. Abundance differential testing of each taxon in NAFLD 16s OTU

05.script_NAFLD/abundance_differential_testing.ipynb Test difference between NASH and Normal group

input:
- data/NAFLD/abd.tsv
- data/NAFLD/NASH_forward_63_map.txt
output:
- result/NAFLD/NASH.Normal.abundance.wilcox_testing.tsv Differential testing result

b. Analyze the NAFLD dataset using NAFLD GCN

05.script_NAFLD/procedure.ipynbAnalyze the NAFLD dataset using NAFLD GCN, compute personalized FR network and find keystone clusters in NASH group and Normal group.

input:
- data/NAFLD/abd.tsv
- data/NAFLD/NASH_forward_63_map.txt
- data/NAFLD/NASH_GCN.tsv
output:
- result/NAFLD/cluster_*/ (* can be NASH/Normal)
  - keystone_node.tsv Species and FRCs with their PR score
- result/NAFLD
  - genome_module.completeness.tsv Completeness of module for each species
  - *.module_comp.wilcox.testing.tsv Testing results of module completeness comparison
  - *_species.tsv Species involved in comparison with FRC/superclusters annotation

6. NSCLC (06.script_NSCLC/)

GCN_fix_tree result is required

Scripts of manuscript section FRCs as immune checkpoint inhibitor indicators can predict patient survival

a. Reproduce original SIG classification

06.script_NSCLC/SIG_SE.ipynb Test difference of SE between response group and non-response group at SIG1/SIG2 clsuter raised in original study and compute S score for each sample.

input:
- data/NSCLC/merged_species.txt
- data/NSCLC/metadata.txt
- data/NSCLC/sig.txt
- data/gcn2008.tsv
- data/sp_d.tsv
- data/NSCLC/DS1_oncology_clinical_data.csv
output:
- result/NSCLC/SIG_SE/
  - fig_kde_disc.pdf Plot of distribution of TOPOSCORE in NR and R group
  - fig_ROC.pdf Plot of ROC for NR/R classification
  - pred_binary_disc.tsv Classification result and real group label for each sample
  - NSCLC.pdf FRC with significant SE difference between NR and R group
  - cluster_sp.json species list of each FRC
  - existed_sp.json species exists in each sample of each FRC

b. Use SE of FRC to classify NR/R groups

06.script_NSCLC/FRC_SE.ipynb Test difference of SE between response group and non-response group at each cluster/supercluster and compute FR S score for each sample.

input:
- data/NSCLC/merged_species.txt
- data/NSCLC/metadata.txt
- data/gcn2008.tsv
- data/sp_d.tsv
- data/NSCLC/DS1_oncology_clinical_data.csv
output:
- result/NSCLC/FRC_SE/
  - fig_kde_disc.pdf Plot of distribution of TOPOSCORE in NR and R group
  - fig_ROC.pdf Plot of ROC for NR/R classification
  - OS_curve.pdf Plot of OS curve
  - pred_binary_disc.tsv Classification result and real group label for each sample
  - NSCLC.pdf FRC with significant SE difference between NR and R group
  - cluster_sp.json species list of each FRC
  - existed_sp.json species exists in each sample of each FRC

c. Use FRC with SIG as SIG' to classify NR/R groups

06.script_NSCLC/c.combination_S_score.ipynb Compute combined sig' S score for each sample.

input:
- result/NSCLC/SIG_SE/cluster_sp.json
- result/NSCLC/SIG_SE/existed_sp.json
- result/NSCLC/FRC_SE/existed_sp.json
- result/NSCLC/FRC_SE/cluster_sp.json
- data/NSCLC/DS1_oncology_clinical_data.csv
output:
- result/NSCLC/combine/
  - fig_kde_disc.pdf Plot of distribution of TOPOSCORE in NR and R group
  - fig_ROC.pdf Plot of ROC for NR/R classification
  - OS_curve.pdf Plot of OS curve
  - pred_binary_disc.tsv Classification result and real group label for each sample

The R scripts used to produce the analysis in original study and is provided by https://github.com/valerioiebba/TOPOSCORE/tree/main.

7. Large scale cohort analysis on priori tree (07.script_cohort_FRC/)

GCN_fix_tree result is required

Scripts of manuscript section Structural entropy of FRCs identified as robust phenotype-specific indicators

input:
- data/gcn2008.tsv
- data/sp_d.tsv
- result/GCN_fix_tree/renamed_GCN_tree.newick
- data/{disease}/{cohort}/ (disease include ACVD, CRC, asthma, carcinoma_surgery_history, STH, migraine, BD, IBD, T2D, hypertension, CFS, IGT, adenoma, schizofrenia)
  - metadata.tsv
  - abd.tsv

a. Compute SE values for FRCs

07.script_cohort_FRC/a.analysis_SE.ipynb Compute SE for FRCs in disease and health group and test the difference.

b. Compute nFR values for FRCs

07.script_cohort_FRC/b.analysis_nFR.ipynb Compute nFR for FRCs in disease and health group and test the difference.

output:
Use SE as an example, nFR is similar
- result/large_scale_cohort/{disease}/{cohort}/SE/se_*.tsv SE value of FRCs in two groups
- result/large_scale_cohort/{disease}/{disease}_se.pdf Plot FRC with significant difference in SE in at least one cohort of the disease
- result/large_scale_cohort/p_all_cohorts_se.tsv pvalues of SE at each FRC in all cohorts
- result/large_scale_cohort/p_all_cohorts_se.svg Plot FRC with significant difference in SE in at least one cohort of all disease

c. Differential testing between disease and health group

07.script_cohort_FRC/c.SE_nFR_differential_testing.ipynb Output some detail statistic information of SE/nFR.

input:
- result/large_scale_cohort/p_all_cohorts_se.tsv
- result/large_scale_cohort/p_all_cohorts_nfr.tsv
- result/large_scale_cohort/{disease}/{cohort}/SE/se_*.tsv
- result/large_scale_cohort/{disease}/{cohort}/nFR/fr_*.tsv
output:
- result/large_scale_cohort/{disease}/{cohort}/SE/p_detail.tsv Statiscic information of SE
- result/large_scale_cohort/{disease}/{cohort}/nFR/p_detail.tsv Statiscic information of nFR

d. Predict phenotype by SE in FRCs

input:
- result/large_scale_cohort/p_all_cohorts_se.tsv
- result/large_scale_cohort/{disease}/{cohort}/SE/se_*.tsv

07.script_cohort_FRC/d.CRC_predict_LODO.ipynb Predict CRC by LODO.
07.script_cohort_FRC/d.IBD_predict_CV.ipynb Predict IBD by cross-validation.
07.script_cohort_FRC/d.IBD_predict_LODO.ipynb Predict IBD by LODO.

output:
- result/predict/{cohort}_{prediction_type}.tsv ROC plot of the prediction
- result/predict/feature_importance_{prediction_type}.tsv Importance of SE in FRCs

f. Random experiment on phenotype

07.script_cohort_FRC/f.pheno_related.ipynb Randamly shuffle label 100 times of sample to prove the relation between SE of FRCs and phentypes.

input:
Use CRC as an example, IBD is also used in this experiment
- result/large_scale_cohort/CRC/p_all_cohorts_se.tsv
- result/large_scale_cohort/CRC/{cohort}/SE/se_*.tsv
- data/CRC/{cohort}/
  - metadata.tsv
  - abd.tsv
output:
- result/validation/phenotype_shuffle/CRC/{cohort}/se_{FRC}*.tsv SE of one random experiment for the FRC
- result/validation/phenotype_shuffle/CRC/{cohort}/pvalues.tsv pvalues of significant difference between disease and health group of the 100 random experiments

8. Personalized FR network analysis (08.script_cohort_keystone/)

Scripts of manuscript section Integrating taxonomic composition to construct a personalized FR network

a. Abundance differential testing of each species

08.script_cohorts_keystone/a.abundance_differential_testing.ipynbTest abundance difference between disease and health group

input:
- data/{disease}/{cohort}/ (disease include ACVD, CRC, asthma, carcinoma_surgery_history, STH, migraine, BD, IBD, T2D, hypertension, CFS, IGT, adenoma, schizofrenia)
  - metadata.tsv
  - abd.tsv
output:
- result/large_scale_cohort/{disease}/{cohort}/{cohort}.abundance.wilcox_testing.tsv Differential testing result

b. Find keystone species and keystone cluster in personalized FR network

08.script_cohorts_keystone/b.personalized_FR_keystone.ipynb

input:
- data/gcn2008.tsv
- data/sp_d.tsv
- data/{disease}/{cohort}/ (disease include ACVD, CRC, asthma, carcinoma_surgery_history, STH, migraine, BD, IBD, T2D, hypertension, CFS, IGT, adenoma, schizofrenia)
  - metadata.tsv
  - abd.tsv
output:
- result/large_scale_cohort/{disease}/{cohort}/sp/cluster_*/keystone_node.tsv Species and FRCs with their PR score
- result/large_scale_cohort/{disease}/{cohort}/sp/layer_0/fr.tsv Personalized FR netowrk

c. Summarize the keystone species in different cohort

08.script_cohorts_keystone/c.keystone_summary.ipynb

input:
- result/large_scale_cohort/{disease}/{cohort}/sp/cluster_*/keystone_node.tsv

9. Personalized FR network nestedness (09. script_personalized_FR_nestedness/)

log effect of personalized FR network

09.script_personalized_FR_nestedness/util_log_effect.ipynb Compute and show effect on the distribution of personalized FR network before and after log rescaled and normalization.

input:
- data/gcn2008.tsv
- data/sp_d.tsv
- data/CRC/CRC1/metadata.tsv'
- data/CRC/CRC1/abd.tsv

nestedness of personalized FR network

Personalized FR network is required. 09.script_personalized_FR_nestedness/util_nestedness_experiment.ipynb Test the nestedness compared with NULL experiments of personalized FR network.

input:
- result/large_scale_cohort/{disease}/{cohort}/sp/layer_0/fr.tsv (disease include ACVD, CRC, asthma, carcinoma_surgery_history, STH, migraine, BD, IBD, T2D, hypertension, CFS, IGT, adenoma, schizofrenia)
output:
- result/personalized_FR_nestedness/p_df.tsv pvalues of the comparison between real FR network nestedness and NULL model nestedness

10. Eigenspecies analysis (10. script_cohorts_eigenspecies/)

Result - Eigenspecies of FRCs demonstrate potential as cross-cohort indicators of age and BMI

GCN_fix_tree result is required

10.script_cohorts_eigenspecies/a.eigenspecies.ipynb Analysis 28 cohorts with eigenspecies framework.

input:
- 'result/GCN_fix_tree/leaves_cluster.tsv
- data/{disease}/{cohort}/abd.tsv (disease include ACVD, CRC, asthma, carcinoma_surgery_history, STH, migraine, BD, IBD, T2D, hypertension, CFS, IGT, adenoma, schizofrenia)
- data/{disease}/{cohort}/metadata.tsv
output:
- result/large_scale_cohort/{disease}/{cohort}/eigenspecies/
- same as 04.script_Antibiotic/d.eigenspecies.ipynb output.

10.script_cohorts_eigenspecies/b.confounders.ipynb Correlation of eigenspecies and ohter phenotype with confounder adjusted.

input:
- a.eigenspecies.ipynb output
output:
- result/large_scale_cohort/{disease}/{cohort}/phenotype/
  - confounder.stats.tsv Confounder statistic
  - confounder.summary.tsv Summary of confounder
  - duplicate_variables.tsv Duplicate variables discard
  - recommended_confounders.txt Recommended confounder used in regression
  - eigenspecies_target_analysis.tsv regression results
  - significant_associations.tsv Significantly association with FDR adjusted p-value < 0.05

11. Simulation (11. script_simulation/)

11.script_simulation/a.se_structure_simulation.ipynb Rearrange edges with large weights, make them inside/outside/randomly in cluster and compare the SE of the network

input:
- data/CRC/{cohort}/abd.tsv
- data/CRC/{cohort}/metadata.tsv
- result/large_scale_cohort/p_all_cohorts_se.tsv
- result/GCN_fix_tree/renamed_GCN_tree.newick
output:
- result/validation/se_structure_simulation/CRC/se_p_values.tsv pvalues of the comparison between the three situations
- result/validation/se_structure_simulation/CRC/se_mean_std.tsv Statistic result of SE values of the 100 experiment under the three situations
- result/validation/se_structure_simulation/CRC/se_summary.tsv SE values of the 100 experiment under the three situations

11.script_simulation/b.reduction_simulation.ipynb Taxonomy abundance reduction permutation simulation

input:
- ../data/NAFLD/abd.tsv
- ../data/NAFLD/NASH_forward_63_map.txt
- cancer_causal_threshold_80/ Pre-bulit causal inference matrix
output:
- result/perturbation_simulation/
- simulation_params_root_seed_42.tsv simulation pararmeter generated based on seed = 42
- simulation_results_root_seed_42_reduction_*.tsv Simulation result with reduction [0.05/0.1/0.15/0.2]
- FR_boxplot_root_seed_42_all_reductions.png

Plot tool

Scripts under plot_tools/ are used to plot figures.

init_network.ipynb init FR network layout.
input:

data/cMD.select_2008.species_phylum.tsv output:
plot_tool/sector_sp_layout.tsv sector layout file sector_sp_layout.tsv for network plot

NAFLD_draw.ipynb Plot networks of NASH and health dataset.
input:

NAFLD/taxonomy.tsv
plot_tool/NAFLD_layout.tsv
result/NAFLD/cluster_*/keystone_node.tsv
result/NAFLD/cluster_*/layer_0/fr.tsv
output:
result/NAFLD/cluster_*/network.svg
example:

procedure_draw_network.ipynb Scripts used to plot personalized FR network for disease and health group.
input:

plot_tool/sector_sp_layout.tsv
result/large_scale_cohort/{disease}/{cohort}/sp/cluster_*/keystone_node.tsv
result/large_scale_cohort/{disease}/{cohort}/sp/cluster_*/layer_0/fr.tsv
output:
result/large_scale_cohort/{disease}/{cohort}/sp/cluster_*/network.svg
example:

pheno_distribution_se.ipynb Plot SE distribution for disease and health group.
input:

result/large_scale_cohort/p_all_cohorts_se.tsv
result/large_scale_cohort/{disease}/{cohort}/SE/se_*.tsv
output:
result/large_scale_cohort/{disease}/{cohort}/SE_distribution/cluster_*/{cohort}.svg
example:

plot_keystone.ipynb.ipynb Plot keystone result of phenotype datasets.
input:

result/GCN_fix_tree/leaves_cluster.tsv
result/large_scale_cohort/{disease}/{cohort}/{cohort}.abundance.wilcox_testing.tsv
result/large_scale_cohort/{disease}/{cohort}/sp/cluster_*/keystone_node.tsv
output:
result/keystone/{cohort}.PR.svg
example:

NSCLC_distribution_se.ipynb Plot SE distribution for response group and non-response group.
input:

result/NSCLC/FRC_SE/Disc/se_*.tsv
result/NSCLC/FRC_SE/p_detail.tsv
output:
result/NSCLC/FRC_SE_distribution/{cluster}/Disc.svg
example:

pheno_simulation_plot.ipynb Plot simulated pvalues and real pvalues
input:

result/large_scale_cohort/p_all_cohorts_se.tsv
result/large_scale_cohort/{disease}/{cohort}/SE/p_detail.tsv
result/validation/phenotype_shuffle/{disease}/{cohort}/pvalues.tsv
output:
result/validation/phenotype_shuffle/{disease}/{clsuter}.svg
example:

simu_se_strcture_plot.ipynb Plot simulated SE values
input:

result/validation/se_structure_simulation/CRC/se_summary.tsv
result/validation/se_structure_simulation/CRC/se_p_values.tsv
output:
result/validation/se_structure_simulation/CRC/se_summary_scatter.svg
result/validation/se_structure_simulation/CRC/se_summary_boxplot.svg
example:

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
01.script_priori_tree		01.script_priori_tree
02.script_signature_modules		02.script_signature_modules
03.script_FMT		03.script_FMT
04.script_Antibiotic		04.script_Antibiotic
05.script_NAFLD		05.script_NAFLD
06.script_NSCLC		06.script_NSCLC
07.script_cohorts_FRC		07.script_cohorts_FRC
08.script_cohorts_keystone		08.script_cohorts_keystone
09.script_personalized_FR_nestedness		09.script_personalized_FR_nestedness
10.script_cohorts_eigenspecies		10.script_cohorts_eigenspecies
11.script_simulation		11.script_simulation
plot_tools		plot_tools
readme_fig		readme_fig
src		src
.gitignore		.gitignore
LICENSE		LICENSE
data.zip		data.zip
readme.md		readme.md

Folders and files

Latest commit

History

Repository files navigation

Tutorial

Table of Contents

Environment installation

Install with Conda environment

Install required python package

Install required R package

Input files

Scripts

1. Prior GCN structure (01.script_priori_tree/)

a. Compute species distance from GCN [optional]

b. Constructing a priori functional redundancy hierarchical structure of species via structural entropy

c. Detect FRC/supercluster enriched/depleted KOs

Evaluation of GCN

2. Completeness of FRC (02.script_signature_modules/)

a. Compute the module completeness of each taxon in GCN2008

c.Signature modules of superclusters/FRCs

3. FMT (03.script_FMT/)

a.Mutiple regression on nFR

b.Mutiple regression on SE value

c.Mutiple regression on FR

d. Compute FR at each cluster/supercluster for each timepoint

e. Mutiple regression on fd/td/nFR at root

f. merge and output result

4. Antibiotic treatment (04.script_Antibiotic/)

a. nFR analysis

b. SE analysis

c. Differential testing of SE/nFR

d. Eigenspecies analysis

e. Correlation between eigenspecies and taxonomic diversity

5. NAFLD (05.script_NAFLD/)

a. Abundance differential testing of each taxon in NAFLD 16s OTU

b. Analyze the NAFLD dataset using NAFLD GCN

6. NSCLC (06.script_NSCLC/)

a. Reproduce original SIG classification

b. Use SE of FRC to classify NR/R groups

c. Use FRC with SIG as SIG' to classify NR/R groups

7. Large scale cohort analysis on priori tree (07.script_cohort_FRC/)

a. Compute SE values for FRCs

b. Compute nFR values for FRCs

c. Differential testing between disease and health group

d. Predict phenotype by SE in FRCs

f. Random experiment on phenotype

8. Personalized FR network analysis (08.script_cohort_keystone/)

a. Abundance differential testing of each species

b. Find keystone species and keystone cluster in personalized FR network

c. Summarize the keystone species in different cohort

9. Personalized FR network nestedness (09. script_personalized_FR_nestedness/)

log effect of personalized FR network

nestedness of personalized FR network

10. Eigenspecies analysis (10. script_cohorts_eigenspecies/)

11. Simulation (11. script_simulation/)

Plot tool

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages