Skip to content

Latest commit

 

History

History
68 lines (50 loc) · 4.88 KB

File metadata and controls

68 lines (50 loc) · 4.88 KB

Scripts for Reproducing MetaTCR Paper Results

This directory contains the scripts used to generate the figures and results for the paper "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets".

The core MetaTCR python package and usage instructions can be found in https://github.com/deepomicslab/MetaTCR/tree/main.

Directory Organization

The scripts are organized by the section of the paper they correspond to. Below is a description of each directory and the experiments they contain.

1. Analysis of Pervasive Batch Effects (dataset_feature_stats/, cdr3_analysis/)

Corresponds to Figure 1

These scripts quantify the baseline technical divergence between datasets.

  • dataset_feature_stats/:
    • Shannon Entropy: Calculates and compares repertoire diversity across cohorts (healthy_datasets_shannon.py, melanoma_datasets_shannon.py).
    • k-mer Distribution: Performs PCA on CDR3 k-mer counts to visualize batch-driven clustering (*_kmer_distribution2x2.py).
  • cdr3_analysis/:
    • Clonotype Overlap: Analyzes the fraction of shared clonotypes within vs. across studies to demonstrate sparsity (plot_shared_cdr3_by_datasets_heatmap.py).

2. MetaTCR Framework Optimization (functional_cluster_num/)

Scripts for constructing the "Referenced TCR Space" and determining the optimal granularity.

  • determin_optimal_cluster_num_by_antigen.py: Benchmarks different cluster numbers ($k$) using antigen specificity and epitope purity metrics.
  • spectral_robustness_k96.py: Evaluates the stability and robustness of the chosen $k=96$ functional clusters.
  • function_cluster_by_spectral.py: Implementation of the spectral clustering step for reference construction.

3. Meta-vector Profiling Evaluation (metavec_evaluation/)

Corresponds to Figure 3

Validates that the Meta-vector representation retains biological signals while exposing technical noise.

  • biological_rep/: Analysis of the Genolet2023 dataset to quantify technical noise between biological replicates across batches/platforms (Genolet2023_data_analysis_abundance.py).
  • robustness_Sherwood2015/: Longitudinal analysis of the Sherwood2015 dataset to demonstrate intra-individual stability of meta-vectors (meta_vec_individual_robunstness.py).
  • feature_compare/: Comparison of MetaTCR against other feature encoding methods (e.g., k-mers, V/J usage).

4. Benchmarking Batch Dissimilarity Metrics (metric_benchmarking/)

Corresponds to Figure 4

A systematic evaluation of metrics (kBET, JSD, iLISI, MMD) to identify the best tool for quantifying batch effects.

  • scenario1_simu_by_three_methods.py: Task 1 (Sensitivity) - Tests metric correlation with simulated batch effect magnitudes.
  • scenario2_data_sampling_robustness.py: Task 2 (Stability) - Evaluates metric robustness to sampling size and stochasticity.
  • scenario3_dataset_pairs_classification_auc_multi_class.py: Task 3 (Discrimination) - Assesses ability to distinguish real-world technical batches from biological variation.
  • benchmark_metrics_by_ranking_scaled.py: Generates the final ranking of metrics (Fig 4e).

5. Integration Methods Benchmarking (integration_benchmarking/)

Corresponds to Figure 5

Evaluates integration algorithms (Covariance Matching, Harmony, MNN, Scanorama) on MetaTCR data.

  • 2.simu_celltype_test_integration_tools.py: Task 4 (Archetype Simulation) - Tests preservation of biological structure (Bio-Silhouette) vs. batch mixing (kBET) in simulated repertoires.
  • 3.integration_same_label.py: Task 5 (Real-world Integration) - Integration of healthy/melanoma cohorts from different studies.
  • 1.domain_shift_on_cmv_clf_with_val_baseline.py: Task 6 (Generalizability) - Cross-study transfer learning (CMV serostatus prediction) to measure improvement in model performance after integration.

6. Gastric Cancer Case Study (case_study/)

Corresponds to Figure 6

Application of the framework to the Wang2022 dataset to detect and correct latent batch effects.

  • wang2022_data_segmentation.py: Unsupervised segmentation algorithm to discover hidden technical subgroups (latent batches).
  • wang2022_group_comparison.py: Comparison of the identified latent batches.
  • Vgene/:
    • case_vdj_fold_change_Wang2022_meta.py: Differential V/J gene usage analysis before and after batch correction (Fig 6c-f).
    • case_vdj_distribution_Wang2022_sample_level.py: Visualization of specific gene distributions.

7. Novel TCR Analysis (find_outlier_tcrs/)

Corresponds to Discussion Section

Exploratory analysis of TCRs that do not fit into the static reference clusters ("Novel TCRs").

  • novel_groups_emerson2017_new.py: Identification of novel TCR groups in the Emerson2017 dataset.
  • outline_melanoma_dataset_distribution.py: Distribution of outlier TCRs across melanoma datasets.