Skip to content

samibahig/RecoverProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RecoverProject

🧬 RECOVER Project — Long COVID Genomic Analysis

Python scikit-learn Research Domain License

ML analysis of metabolomic and proteomic data to predict Long COVID — conducted at Laboratoire Jacques Corbeil / François Laviolette, Université Laval.

Sami Bahig, MD MSc


🎯 Research Context

Long COVID remains poorly understood — risk factors, biomarkers, and treatment options are largely unknown. The only current medical consensus: symptoms persisting more than 12 weeks after initial infection.

This project applies machine learning to metabolomic and proteomic datasets to:

  • Identify molecular signatures of Long COVID
  • Predict Long COVID status from biological data
  • Assess the impact of initial symptoms (fever, fatigue, anosmia) on prediction accuracy

🗂️ Datasets & Analyses

Analysis 1 — Metabolomic (Multi-timepoint)

  • Multiple ML models benchmarked on metabolomic data
  • 80/20 bootstrapping + cross-validation

Analysis 2 — Metabolomic (D0 — 100 patients)

  • Timepoint D0: 100 patients
  • Focus on early-stage biomarker prediction

Datasets analyzed

Dataset Description
Metabolomic Metabolite profiles — Long COVID vs control
Proteomic Protein expression profiles
Proteomic.cyst Cyst-specific proteomic subset
Proteomic.Merge Combined proteomic dataset

📊 Results

Full Benchmark — 7 Models × 4 Datasets (80/20 Bootstrap + CV)

Dataset Decision Tree Random Forest Random SCM Ridge Classifier SVM (Poly) PLS-DA SVM (rbf)
Proteomic — Train 90.9% ±7.4 88.2% ±8.5 91.2% ±14.6 99.9% ±0.4 86.5% ±14.0 58.0% ±2.6 67.5% ±22.1
Proteomic — Test 50.3% ±10.9 55.1% ±10.0 51.6% ±9.4 52.8% ±11.1 49.3% ±10.3 50.8% ±8.8 43.4% ±7.8
Metabolomic — Train 100% ±0.0 100% ±0.2 100% ±0.0 100% ±0.0 99.2% ±1.6 66.0% ±2.6 99.6% ±0.9
Metabolomic — Test 97.7% ±3.7 99.0% ±2.6 97.3% ±3.5 67.3% ±10.6 99.2% ±1.7 65.8% ±11.6 72.8% ±9.0
Proteomic.cyst — Train 87.1% ±8.1 93.8% ±8.1 76.1% ±10.0 88.3% ±2.9 85.3% ±7.3 53.9% ±2.8 61.8% ±19.1
Proteomic.cyst — Test 48.9% ±8.7 53.1% ±9.5 51.0% ±10.2 58.7% ±10.7 58.1% ±10.5 50.0% ±9.6 42.9% ±5.4

Effect of Initial Symptoms on Accuracy (Random Forest)

Symptom Train Accuracy Test Accuracy
Fever 89.8% ±5.3 64.0% ±11.9
Fatigue 90.7% ±4.5 63.3% ±12.3
Anosmia 90.1% ±4.6 63.8% ±12.1
Fever + Fatigue + Anosmia 90.9% ±3.5 63.6% ±12.7

Key Insight

Metabolomic data significantly outperforms proteomic data for Long COVID prediction — Random Forest and SVM(Poly) achieve ~99% test accuracy on metabolomic data, while proteomic models struggle to generalize (test accuracy ~50%). This suggests metabolomic signatures are stronger biomarkers for Long COVID status.


🛠️ Models Benchmarked

Model Notes
Decision Tree Baseline interpretable model
Random Forest Best overall on metabolomic data
Random SCM Set Covering Machine — rule-based
Ridge Classifier Linear baseline
SVM (Poly + RBF) Strong on metabolomic data
PLS-DA Partial Least Squares Discriminant Analysis
ElasticNet Regularized linear model

🗂️ Repository Structure

RecoverProject/
│
├── 📓 Copy_of_PLS_DA_Metabolomique.ipynb
├── 📓 DecisionTreeMetabolomic*.ipynb
├── 📓 DecisionTreeProteomic.ipynb
├── 📓 ElasticNet.ipynb
├── 📓 MergingDataset.ipynb
├── 📓 MergingDataset_Visualisation.ipynb
├── 📓 PLS_DA_Metabolomique.ipynb
├── 📓 RandomForest_Metabolomic.ipynb
├── 📓 RandomForest_Proteomic*.ipynb
├── 📓 RandomSCM*.ipynb
├── 📓 RidgeClassifier*.ipynb
├── 📓 svm*.ipynb
├── 📓 Recover_CleaningData_code.ipynb  ← Data preprocessing
└── 📖 README.md

🔮 Next Steps (as of presentation)

  • AlphaFold 2 integration for protein structure analysis
  • Graph Neural Networks for molecular interaction modeling
  • Biobank data integration — larger cohort incoming

👤 Author

Sami Bahig, MD MSc — Data Scientist & AI Engineer Laboratoire Jacques Corbeil / François Laviolette — Université Laval

LinkedIn GitHub


🔗 Related Projects

Project Description Link
FAERS 2025 FDA pharmacovigilance — 28M adverse event records GitHub
Protocol TDM OCR + CamemBERT — Radiology protocol classification GitHub
RAG Chatbot AI chatbot grounded in documents — LangChain · ChromaDB GitHub

MIT License · Sami Bahig · Université Laval · 2022

About

ML analysis of metabolomic & proteomic data for Long COVID prediction — Université Laval / Labo Corbeil

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors