Simple machine learning pipeline for predicting diabetes risk from a series of different health indicators. Jut purely made for studying purposes with a structured process of data preperation, model tuning, evaluations and disrepency of a class imbalance.
This project trains a logistic regression classifier on the CDC BRFSS Diabetes Health Indicators dataset to estimate diabetes risk from 21 lifestyle and health survey features (BMI, blood pressure, cholesterol, physical activity, general health, age, etc.).
The focus is not just on building a model that performs well, but on evaluating it honestly showing how the same model can look good on balanced test data and noticeably weaker on realistic, imbalanced populations.