This project is for learning some classification approaches applied on Brain Cancer MRI data, available on Kaggle
The whole repository is aimed to have more of a benchmarking approach, rather than picking a model and hyperparameter tuning for the best results achievable with that model, though there will be that as well.
Clone the repository:
git clone https://github.com/dorukresmi/kaggle_brain_cancer_mri.git
cd kaggle_brain_cancer_mriInstall dependencies:
python -m venv .venv
pip install -r requirements.txtDownload and prepare the dataset:
- Download the dataset from Mendeley Data.
- Place the extracted data in the
Brain_Cancer/directory (or update the path in your argument_parser.py).
Run an experiment:
- With a config file:
python main.py --config config.yaml
- Or directly from the command line:
python main.py --models_ft resnet152 --seed 42 python main.py --models_ml resnet152 --classifier xgboost --seed 42
The code is designed to be used via CLI, in an environment that supports parallelism, for easy scale up. For each parameter combination provided by the user, a training job is created in parallel. If the results for givencombination is already existing, the training is skipped.
To incorporate seamless testing of different approaches, a unified model training pipeline is to be developed. The function will take the training choices as parameters, the necessary pipeline being set up automatically. This will allow the user to test different ideas or training schemes just by providing the necessary parameters, maybe even some training parallelism, though the whole process is thought to be trained on (my) moderate laptop.
Here is the list of all possible arguments for the training.
-
--config <PATH>Path to the configuration YAML file
-
--models_ft <STR>
The pretrained vision models to be used for finetuning -
--models_ml <STR>
The pretrained vision models to be used for feature extraction -
--classifier <STR>The machine learning models to be used for classification, to be used in tandem with feature extraction -
--seeds <INT>
List of random seeds for reproducible training runs. -
--parallel <INT>
Number of parallel jobs to run.
The dataset consists of grayscale images of MRI scans of brain cancer patients, taken from across the hospitals of Bangladesh. There are 3 different groups of brain cancer images present in this dataset; meningioma, glioma and pituitary tumor, without the images of healthy individuals.
The original source of the data, with further information:
https://data.mendeley.com/datasets/mk56jw9rns/1
Some investigations before moving on. Even the dataset has 10.0 score on Kaggle, I wanted to make sure that there are no unpleasant surprises.
Findings:
- All images are truly grayscale, even though stored as rgb images
- All images are of the same exact size, 512*512
- There are duplicates (44 extra images to be exact) in the brain_tumor dataset, I removed them too. This normalized the class sized to 2004 across all classes (nice!)
The initial implementation is to have a direct classifier approach to be trained. Though it must be noted that the cancer to be detected is localized in the specific regions of the brain, thus a better segmentation based approach might be necessary too. For the time being, the classification is prefered for the simplicity.
There is 2 training "branches" in this repository generally speaking. First one is directly finetuning the pretrained vision models with the image dataset we have at hand. Second part is to use the same models with the classification head removed as a feature extractor, to be used for machine learning algorithm training.
As aligned with the learning goals of this repository, 4 different pretrained neural networks will be tested, both with a finetuning on the dataset as well as using these network as a feature extractor to be followed by ML classification algorithm, in a mix-n-match manner, resulting in 4*#ML+1 number of total models trained.
The used pretrained models are:
- ResNet152
- EfficientNetV2_L
- SwinTransformer_v2_b (if i can get that working)
- Densenet201
And the ML algorithms used for classification are:
- XGBoost
- SVM with RBF Kernel
- Random Forest
- Logistic Regression (as a baseline)
- Quadratic Discriminant Analysis
Since the pretrained models preferred are already bulky to begin with, it performs poorly in lacking hardware. Feel free to increase resources or reduce the model size as prefered.
This project is for educational and research purposes. Please check the dataset's original license for usage restrictions.