RecoverProject

🧬 RECOVER Project — Long COVID Genomic Analysis

ML analysis of metabolomic and proteomic data to predict Long COVID — conducted at Laboratoire Jacques Corbeil / François Laviolette, Université Laval.

Sami Bahig, MD MSc

🎯 Research Context

Long COVID remains poorly understood — risk factors, biomarkers, and treatment options are largely unknown. The only current medical consensus: symptoms persisting more than 12 weeks after initial infection.

This project applies machine learning to metabolomic and proteomic datasets to:

Identify molecular signatures of Long COVID
Predict Long COVID status from biological data
Assess the impact of initial symptoms (fever, fatigue, anosmia) on prediction accuracy

🗂️ Datasets & Analyses

Analysis 1 — Metabolomic (Multi-timepoint)

Multiple ML models benchmarked on metabolomic data
80/20 bootstrapping + cross-validation

Analysis 2 — Metabolomic (D0 — 100 patients)

Timepoint D0: 100 patients
Focus on early-stage biomarker prediction

Datasets analyzed

Dataset	Description
Metabolomic	Metabolite profiles — Long COVID vs control
Proteomic	Protein expression profiles
Proteomic.cyst	Cyst-specific proteomic subset
Proteomic.Merge	Combined proteomic dataset

📊 Results

Full Benchmark — 7 Models × 4 Datasets (80/20 Bootstrap + CV)

Dataset	Decision Tree	Random Forest	Random SCM	Ridge Classifier	SVM (Poly)	PLS-DA	SVM (rbf)
Proteomic — Train	90.9% ±7.4	88.2% ±8.5	91.2% ±14.6	99.9% ±0.4	86.5% ±14.0	58.0% ±2.6	67.5% ±22.1
Proteomic — Test	50.3% ±10.9	55.1% ±10.0	51.6% ±9.4	52.8% ±11.1	49.3% ±10.3	50.8% ±8.8	43.4% ±7.8
Metabolomic — Train	100% ±0.0	100% ±0.2	100% ±0.0	100% ±0.0	99.2% ±1.6	66.0% ±2.6	99.6% ±0.9
Metabolomic — Test	97.7% ±3.7	99.0% ±2.6	97.3% ±3.5	67.3% ±10.6	99.2% ±1.7	65.8% ±11.6	72.8% ±9.0
Proteomic.cyst — Train	87.1% ±8.1	93.8% ±8.1	76.1% ±10.0	88.3% ±2.9	85.3% ±7.3	53.9% ±2.8	61.8% ±19.1
Proteomic.cyst — Test	48.9% ±8.7	53.1% ±9.5	51.0% ±10.2	58.7% ±10.7	58.1% ±10.5	50.0% ±9.6	42.9% ±5.4

Effect of Initial Symptoms on Accuracy (Random Forest)

Symptom	Train Accuracy	Test Accuracy
Fever	89.8% ±5.3	64.0% ±11.9
Fatigue	90.7% ±4.5	63.3% ±12.3
Anosmia	90.1% ±4.6	63.8% ±12.1
Fever + Fatigue + Anosmia	90.9% ±3.5	63.6% ±12.7

Key Insight

Metabolomic data significantly outperforms proteomic data for Long COVID prediction — Random Forest and SVM(Poly) achieve ~99% test accuracy on metabolomic data, while proteomic models struggle to generalize (test accuracy ~50%). This suggests metabolomic signatures are stronger biomarkers for Long COVID status.

🛠️ Models Benchmarked

Model	Notes
Decision Tree	Baseline interpretable model
Random Forest	Best overall on metabolomic data
Random SCM	Set Covering Machine — rule-based
Ridge Classifier	Linear baseline
SVM (Poly + RBF)	Strong on metabolomic data
PLS-DA	Partial Least Squares Discriminant Analysis
ElasticNet	Regularized linear model

🗂️ Repository Structure

RecoverProject/
│
├── 📓 Copy_of_PLS_DA_Metabolomique.ipynb
├── 📓 DecisionTreeMetabolomic*.ipynb
├── 📓 DecisionTreeProteomic.ipynb
├── 📓 ElasticNet.ipynb
├── 📓 MergingDataset.ipynb
├── 📓 MergingDataset_Visualisation.ipynb
├── 📓 PLS_DA_Metabolomique.ipynb
├── 📓 RandomForest_Metabolomic.ipynb
├── 📓 RandomForest_Proteomic*.ipynb
├── 📓 RandomSCM*.ipynb
├── 📓 RidgeClassifier*.ipynb
├── 📓 svm*.ipynb
├── 📓 Recover_CleaningData_code.ipynb  ← Data preprocessing
└── 📖 README.md

🔮 Next Steps (as of presentation)

AlphaFold 2 integration for protein structure analysis
Graph Neural Networks for molecular interaction modeling
Biobank data integration — larger cohort incoming

👤 Author

Sami Bahig, MD MSc — Data Scientist & AI Engineer Laboratoire Jacques Corbeil / François Laviolette — Université Laval

🔗 Related Projects

Project	Description	Link
FAERS 2025	FDA pharmacovigilance — 28M adverse event records	GitHub
Protocol TDM	OCR + CamemBERT — Radiology protocol classification	GitHub
RAG Chatbot	AI chatbot grounded in documents — LangChain · ChromaDB	GitHub

MIT License · Sami Bahig · Université Laval · 2022

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Copy_of_PLS_DA_Metabolomique.ipynb		Copy_of_PLS_DA_Metabolomique.ipynb
DecisionTreeMetabolomicThibault.ipynb		DecisionTreeMetabolomicThibault.ipynb
DecisionTreeMetabolomicThibault2.ipynb		DecisionTreeMetabolomicThibault2.ipynb
DecisionTreeProteomic.ipynb		DecisionTreeProteomic.ipynb
ElasticNet.ipynb		ElasticNet.ipynb
MergingDataset.ipynb		MergingDataset.ipynb
MergingDataset_Visualisation.ipynb		MergingDataset_Visualisation.ipynb
PLS_DA_Metabolomique.ipynb		PLS_DA_Metabolomique.ipynb
README.md		README.md
RandomForest_Metabolomic.ipynb		RandomForest_Metabolomic.ipynb
RandomForest_Proteomic+Proteomic_cyst_Merge.ipynb		RandomForest_Proteomic+Proteomic_cyst_Merge.ipynb
RandomForest_ProteomicMerge.ipynb		RandomForest_ProteomicMerge.ipynb
RandomSCMProteomique.ipynb		RandomSCMProteomique.ipynb
RandomSCMProteomique_Cyst.ipynb		RandomSCMProteomique_Cyst.ipynb
RandomSCM_Métabolomique.ipynb		RandomSCM_Métabolomique.ipynb
RandomSCM_Proteomique.ipynb		RandomSCM_Proteomique.ipynb
Recover_CleaningData_code.ipynb		Recover_CleaningData_code.ipynb
RidgeClassifierMetabolomique.ipynb		RidgeClassifierMetabolomique.ipynb
RidgeClassifierProteomique.ipynb		RidgeClassifierProteomique.ipynb
RidgeMetabolomique.ipynb		RidgeMetabolomique.ipynb
svmMetabolomique.ipynb		svmMetabolomique.ipynb
svmProteomique.ipynb		svmProteomique.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RecoverProject

🧬 RECOVER Project — Long COVID Genomic Analysis

🎯 Research Context

🗂️ Datasets & Analyses

Analysis 1 — Metabolomic (Multi-timepoint)

Analysis 2 — Metabolomic (D0 — 100 patients)

Datasets analyzed

📊 Results

Full Benchmark — 7 Models × 4 Datasets (80/20 Bootstrap + CV)

Effect of Initial Symptoms on Accuracy (Random Forest)

Key Insight

🛠️ Models Benchmarked

🗂️ Repository Structure

🔮 Next Steps (as of presentation)

👤 Author

🔗 Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RecoverProject

🧬 RECOVER Project — Long COVID Genomic Analysis

🎯 Research Context

🗂️ Datasets & Analyses

Analysis 1 — Metabolomic (Multi-timepoint)

Analysis 2 — Metabolomic (D0 — 100 patients)

Datasets analyzed

📊 Results

Full Benchmark — 7 Models × 4 Datasets (80/20 Bootstrap + CV)

Effect of Initial Symptoms on Accuracy (Random Forest)

Key Insight

🛠️ Models Benchmarked

🗂️ Repository Structure

🔮 Next Steps (as of presentation)

👤 Author

🔗 Related Projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages