A comprehensive machine learning pipeline for diabetes prediction using the Pima Indians Diabetes dataset. This project demonstrates end-to-end data science workflow including data exploration, preprocessing, feature engineering, model training, hyperparameter tuning, and evaluation.
This project builds a diabetes prediction system that can help healthcare professionals identify patients at risk of developing diabetes. The pipeline emphasizes high recall (sensitivity) to minimize false negatives, which is critical in medical applications.
- Complete ML Pipeline: From data exploration to model evaluation
- Medical Focus: Optimized for healthcare applications with emphasis on recall
- Class Imbalance Handling: Addresses dataset imbalance using multiple techniques
- Feature Engineering: Creates meaningful features for better prediction
- Multiple Algorithms: Compares various models including Logistic Regression, Decision Trees, Random Forest, and XGBoost
The project uses the Pima Indians Diabetes Database containing medical and demographic data from 768 patients.
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age in years
- Outcome: Diabetes status (0: No, 1: Yes)
Source: Kaggle - Pima Indians Diabetes Database
pip install pandas numpy scikit-learn matplotlib seaborn xgboost jupyter- Clone or download the repository
- Ensure the dataset is in the
data/directory - Open
DiabetPredictor.ipynbin Jupyter Notebook or VS Code - Run all cells to execute the complete pipeline
- Statistical analysis of features
- Distribution visualization
- Missing value identification
- Class imbalance analysis
- Missing Value Handling: Replace invalid zeros with NaN and apply mean imputation
- Feature Engineering: Create Insulin/Glucose ratio and polynomial features
- Scaling: RobustScaler to handle outliers
- Train/Test Split: 80/20 split with stratification
Multiple algorithms tested:
- Logistic Regression (baseline and polynomial)
- Decision Trees
- Random Forest
- XGBoost (best performer)
- Grid search for optimal parameters
- Cross-validation for robust evaluation
- Class imbalance handling techniques
| Metric | Value | Improvement |
|---|---|---|
| Recall | 76% | 52% improvement from baseline |
| False Negatives | 13 | Reduced from 27 |
| Clinical Impact | High | Fewer undiagnosed cases |
| Model | Recall | False Negatives | Notes |
|---|---|---|---|
| Baseline LogReg | ~0.50 | 27 | Poor sensitivity |
| LogReg (degree 4) | ~0.67 | 18 | Polynomial features help |
| LogReg + Ridge | ~0.52 | 26 | Over-regularization |
| XGBoost + Class Weight | ~0.76 | 13 | Best performer |
| LogReg + Undersampling | ~0.54 | 25 | Information loss |
- False Negatives = Undiagnosed Diabetes: Missing positive cases in medical screening is dangerous
- Early Detection: Higher recall enables earlier intervention and treatment
- Risk Mitigation: Reduces long-term complications from undiagnosed diabetes
This model can assist healthcare professionals by:
- Screening patients during routine checkups
- Identifying high-risk individuals for further testing
- Supporting clinical decision-making
- Reducing healthcare costs through early intervention
- Missing Value Strategy: Mean imputation chosen over median due to feature distributions
- Scaling Method: RobustScaler selected due to presence of outliers
- Feature Engineering: Insulin/Glucose ratio captures metabolic relationships
- Class Imbalance: Algorithmic approach (scale_pos_weight) outperformed resampling
- XGBoost Success Factors:
- Gradient boosting captures feature interactions
- Built-in class imbalance handling
- Robust to outliers
- Excellent generalization
DiabetPredictor/
├── DiabetPredictor.ipynb # Main notebook with complete pipeline
├── data/
│ └── diabetes.csv # Pima Indians Diabetes dataset
└── README.md # Project documentation
- Dataset Size: Limited to 768 samples
- Population Specificity: Trained on Pima Indians data
- Feature Scope: Limited to 8 medical features
- Larger Datasets: Validate on diverse populations
- Advanced Techniques: SMOTE oversampling, deep learning
- Feature Selection: Automated feature importance analysis
- Deployment: Web app or API for clinical use
- Explainability: SHAP values for model interpretability
This project demonstrates:
- End-to-end machine learning pipeline development
- Medical data preprocessing challenges
- Class imbalance handling strategies
- Feature engineering for domain-specific applications
- Model evaluation for high-stakes applications
- Clinical considerations in ML model development
Feel free to contribute by:
- Adding new algorithms or techniques
- Improving preprocessing methods
- Enhancing visualization
- Adding model interpretability features
- Expanding documentation
This project is for educational purposes. The dataset is publicly available on Kaggle.
Note: This model is for educational and research purposes only. It should not be used for actual medical diagnosis without proper clinical validation and regulatory approval.