Skip to content

LuigiGonnella/DiabetPredictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

DiabetPredictor

A comprehensive machine learning pipeline for diabetes prediction using the Pima Indians Diabetes dataset. This project demonstrates end-to-end data science workflow including data exploration, preprocessing, feature engineering, model training, hyperparameter tuning, and evaluation.

🎯 Project Overview

This project builds a diabetes prediction system that can help healthcare professionals identify patients at risk of developing diabetes. The pipeline emphasizes high recall (sensitivity) to minimize false negatives, which is critical in medical applications.

Key Features

  • Complete ML Pipeline: From data exploration to model evaluation
  • Medical Focus: Optimized for healthcare applications with emphasis on recall
  • Class Imbalance Handling: Addresses dataset imbalance using multiple techniques
  • Feature Engineering: Creates meaningful features for better prediction
  • Multiple Algorithms: Compares various models including Logistic Regression, Decision Trees, Random Forest, and XGBoost

📊 Dataset

The project uses the Pima Indians Diabetes Database containing medical and demographic data from 768 patients.

Features:

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age in years
  • Outcome: Diabetes status (0: No, 1: Yes)

Source: Kaggle - Pima Indians Diabetes Database

🚀 Getting Started

Prerequisites

pip install pandas numpy scikit-learn matplotlib seaborn xgboost jupyter

Running the Project

  1. Clone or download the repository
  2. Ensure the dataset is in the data/ directory
  3. Open DiabetPredictor.ipynb in Jupyter Notebook or VS Code
  4. Run all cells to execute the complete pipeline

🔍 Pipeline Components

1. Data Exploration

  • Statistical analysis of features
  • Distribution visualization
  • Missing value identification
  • Class imbalance analysis

2. Data Preprocessing

  • Missing Value Handling: Replace invalid zeros with NaN and apply mean imputation
  • Feature Engineering: Create Insulin/Glucose ratio and polynomial features
  • Scaling: RobustScaler to handle outliers
  • Train/Test Split: 80/20 split with stratification

3. Model Training & Evaluation

Multiple algorithms tested:

  • Logistic Regression (baseline and polynomial)
  • Decision Trees
  • Random Forest
  • XGBoost (best performer)

4. Hyperparameter Tuning

  • Grid search for optimal parameters
  • Cross-validation for robust evaluation
  • Class imbalance handling techniques

📈 Results

Best Model Performance: XGBoost with Class Balancing

Metric Value Improvement
Recall 76% 52% improvement from baseline
False Negatives 13 Reduced from 27
Clinical Impact High Fewer undiagnosed cases

Model Comparison

Model Recall False Negatives Notes
Baseline LogReg ~0.50 27 Poor sensitivity
LogReg (degree 4) ~0.67 18 Polynomial features help
LogReg + Ridge ~0.52 26 Over-regularization
XGBoost + Class Weight ~0.76 13 Best performer
LogReg + Undersampling ~0.54 25 Information loss

🏥 Clinical Relevance

Why High Recall Matters

  • False Negatives = Undiagnosed Diabetes: Missing positive cases in medical screening is dangerous
  • Early Detection: Higher recall enables earlier intervention and treatment
  • Risk Mitigation: Reduces long-term complications from undiagnosed diabetes

Real-World Application

This model can assist healthcare professionals by:

  • Screening patients during routine checkups
  • Identifying high-risk individuals for further testing
  • Supporting clinical decision-making
  • Reducing healthcare costs through early intervention

🔧 Technical Insights

Key Preprocessing Decisions

  1. Missing Value Strategy: Mean imputation chosen over median due to feature distributions
  2. Scaling Method: RobustScaler selected due to presence of outliers
  3. Feature Engineering: Insulin/Glucose ratio captures metabolic relationships
  4. Class Imbalance: Algorithmic approach (scale_pos_weight) outperformed resampling

Algorithm Selection

  • XGBoost Success Factors:
    • Gradient boosting captures feature interactions
    • Built-in class imbalance handling
    • Robust to outliers
    • Excellent generalization

📁 Project Structure

DiabetPredictor/
├── DiabetPredictor.ipynb    # Main notebook with complete pipeline
├── data/
│   └── diabetes.csv         # Pima Indians Diabetes dataset
└── README.md               # Project documentation

⚠️ Limitations & Future Work

Current Limitations

  • Dataset Size: Limited to 768 samples
  • Population Specificity: Trained on Pima Indians data
  • Feature Scope: Limited to 8 medical features

Future Enhancements

  • Larger Datasets: Validate on diverse populations
  • Advanced Techniques: SMOTE oversampling, deep learning
  • Feature Selection: Automated feature importance analysis
  • Deployment: Web app or API for clinical use
  • Explainability: SHAP values for model interpretability

📚 Learning Outcomes

This project demonstrates:

  • End-to-end machine learning pipeline development
  • Medical data preprocessing challenges
  • Class imbalance handling strategies
  • Feature engineering for domain-specific applications
  • Model evaluation for high-stakes applications
  • Clinical considerations in ML model development

🤝 Contributing

Feel free to contribute by:

  • Adding new algorithms or techniques
  • Improving preprocessing methods
  • Enhancing visualization
  • Adding model interpretability features
  • Expanding documentation

📄 License

This project is for educational purposes. The dataset is publicly available on Kaggle.


Note: This model is for educational and research purposes only. It should not be used for actual medical diagnosis without proper clinical validation and regulatory approval.

About

A comprehensive machine learning pipeline for diabetes prediction using the Pima Indians Diabetes dataset. This project demonstrates end-to-end data science workflow including data exploration, preprocessing, feature engineering, model training, hyperparameter tuning, and evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors