ML analysis of metabolomic and proteomic data to predict Long COVID — conducted at Laboratoire Jacques Corbeil / François Laviolette, Université Laval.
Sami Bahig, MD MSc
Long COVID remains poorly understood — risk factors, biomarkers, and treatment options are largely unknown. The only current medical consensus: symptoms persisting more than 12 weeks after initial infection.
This project applies machine learning to metabolomic and proteomic datasets to:
- Identify molecular signatures of Long COVID
- Predict Long COVID status from biological data
- Assess the impact of initial symptoms (fever, fatigue, anosmia) on prediction accuracy
- Multiple ML models benchmarked on metabolomic data
- 80/20 bootstrapping + cross-validation
- Timepoint D0: 100 patients
- Focus on early-stage biomarker prediction
| Dataset | Description |
|---|---|
| Metabolomic | Metabolite profiles — Long COVID vs control |
| Proteomic | Protein expression profiles |
| Proteomic.cyst | Cyst-specific proteomic subset |
| Proteomic.Merge | Combined proteomic dataset |
| Dataset | Decision Tree | Random Forest | Random SCM | Ridge Classifier | SVM (Poly) | PLS-DA | SVM (rbf) |
|---|---|---|---|---|---|---|---|
| Proteomic — Train | 90.9% ±7.4 | 88.2% ±8.5 | 91.2% ±14.6 | 99.9% ±0.4 | 86.5% ±14.0 | 58.0% ±2.6 | 67.5% ±22.1 |
| Proteomic — Test | 50.3% ±10.9 | 55.1% ±10.0 | 51.6% ±9.4 | 52.8% ±11.1 | 49.3% ±10.3 | 50.8% ±8.8 | 43.4% ±7.8 |
| Metabolomic — Train | 100% ±0.0 | 100% ±0.2 | 100% ±0.0 | 100% ±0.0 | 99.2% ±1.6 | 66.0% ±2.6 | 99.6% ±0.9 |
| Metabolomic — Test | 97.7% ±3.7 | 99.0% ±2.6 | 97.3% ±3.5 | 67.3% ±10.6 | 99.2% ±1.7 | 65.8% ±11.6 | 72.8% ±9.0 |
| Proteomic.cyst — Train | 87.1% ±8.1 | 93.8% ±8.1 | 76.1% ±10.0 | 88.3% ±2.9 | 85.3% ±7.3 | 53.9% ±2.8 | 61.8% ±19.1 |
| Proteomic.cyst — Test | 48.9% ±8.7 | 53.1% ±9.5 | 51.0% ±10.2 | 58.7% ±10.7 | 58.1% ±10.5 | 50.0% ±9.6 | 42.9% ±5.4 |
| Symptom | Train Accuracy | Test Accuracy |
|---|---|---|
| Fever | 89.8% ±5.3 | 64.0% ±11.9 |
| Fatigue | 90.7% ±4.5 | 63.3% ±12.3 |
| Anosmia | 90.1% ±4.6 | 63.8% ±12.1 |
| Fever + Fatigue + Anosmia | 90.9% ±3.5 | 63.6% ±12.7 |
Metabolomic data significantly outperforms proteomic data for Long COVID prediction — Random Forest and SVM(Poly) achieve ~99% test accuracy on metabolomic data, while proteomic models struggle to generalize (test accuracy ~50%). This suggests metabolomic signatures are stronger biomarkers for Long COVID status.
| Model | Notes |
|---|---|
| Decision Tree | Baseline interpretable model |
| Random Forest | Best overall on metabolomic data |
| Random SCM | Set Covering Machine — rule-based |
| Ridge Classifier | Linear baseline |
| SVM (Poly + RBF) | Strong on metabolomic data |
| PLS-DA | Partial Least Squares Discriminant Analysis |
| ElasticNet | Regularized linear model |
RecoverProject/
│
├── 📓 Copy_of_PLS_DA_Metabolomique.ipynb
├── 📓 DecisionTreeMetabolomic*.ipynb
├── 📓 DecisionTreeProteomic.ipynb
├── 📓 ElasticNet.ipynb
├── 📓 MergingDataset.ipynb
├── 📓 MergingDataset_Visualisation.ipynb
├── 📓 PLS_DA_Metabolomique.ipynb
├── 📓 RandomForest_Metabolomic.ipynb
├── 📓 RandomForest_Proteomic*.ipynb
├── 📓 RandomSCM*.ipynb
├── 📓 RidgeClassifier*.ipynb
├── 📓 svm*.ipynb
├── 📓 Recover_CleaningData_code.ipynb ← Data preprocessing
└── 📖 README.md
- AlphaFold 2 integration for protein structure analysis
- Graph Neural Networks for molecular interaction modeling
- Biobank data integration — larger cohort incoming
Sami Bahig, MD MSc — Data Scientist & AI Engineer Laboratoire Jacques Corbeil / François Laviolette — Université Laval
| Project | Description | Link |
|---|---|---|
| FAERS 2025 | FDA pharmacovigilance — 28M adverse event records | GitHub |
| Protocol TDM | OCR + CamemBERT — Radiology protocol classification | GitHub |
| RAG Chatbot | AI chatbot grounded in documents — LangChain · ChromaDB | GitHub |
MIT License · Sami Bahig · Université Laval · 2022