Brain Cancer Classification Challenge

This project is for learning some classification approaches applied on Brain Cancer MRI data, available on Kaggle

The whole repository is aimed to have more of a benchmarking approach, rather than picking a model and hyperparameter tuning for the best results achievable with that model, though there will be that as well.

Quickstart

Clone the repository:

git clone https://github.com/dorukresmi/kaggle_brain_cancer_mri.git
cd kaggle_brain_cancer_mri

Install dependencies:

python -m venv .venv
pip install -r requirements.txt

Download and prepare the dataset:

Download the dataset from Mendeley Data.
Place the extracted data in the Brain_Cancer/ directory (or update the path in your argument_parser.py).

Run an experiment:

With a config file:
```
python main.py --config config.yaml
```

Or directly from the command line:

python main.py --models_ft resnet152 --seed 42
python main.py --models_ml resnet152 --classifier xgboost --seed 42

Code

The code is designed to be used via CLI, in an environment that supports parallelism, for easy scale up. For each parameter combination provided by the user, a training job is created in parallel. If the results for givencombination is already existing, the training is skipped.

To incorporate seamless testing of different approaches, a unified model training pipeline is to be developed. The function will take the training choices as parameters, the necessary pipeline being set up automatically. This will allow the user to test different ideas or training schemes just by providing the necessary parameters, maybe even some training parallelism, though the whole process is thought to be trained on (my) moderate laptop.

Configuration File

Here is the list of all possible arguments for the training.

--config <PATH>

Path to the configuration YAML file
--models_ft <STR>
The pretrained vision models to be used for finetuning
--models_ml <STR>
The pretrained vision models to be used for feature extraction
--classifier <STR> The machine learning models to be used for classification, to be used in tandem with feature extraction
--seeds <INT>
List of random seeds for reproducible training runs.
--parallel <INT>
Number of parallel jobs to run.

Dataset

The dataset consists of grayscale images of MRI scans of brain cancer patients, taken from across the hospitals of Bangladesh. There are 3 different groups of brain cancer images present in this dataset; meningioma, glioma and pituitary tumor, without the images of healthy individuals.

The original source of the data, with further information:

https://data.mendeley.com/datasets/mk56jw9rns/1

Some insights on the dataset

Some investigations before moving on. Even the dataset has 10.0 score on Kaggle, I wanted to make sure that there are no unpleasant surprises.

Findings:

All images are truly grayscale, even though stored as rgb images
All images are of the same exact size, 512*512
There are duplicates (44 extra images to be exact) in the brain_tumor dataset, I removed them too. This normalized the class sized to 2004 across all classes (nice!)

Training Approach

The initial implementation is to have a direct classifier approach to be trained. Though it must be noted that the cancer to be detected is localized in the specific regions of the brain, thus a better segmentation based approach might be necessary too. For the time being, the classification is prefered for the simplicity.

There is 2 training "branches" in this repository generally speaking. First one is directly finetuning the pretrained vision models with the image dataset we have at hand. Second part is to use the same models with the classification head removed as a feature extractor, to be used for machine learning algorithm training.

Models

As aligned with the learning goals of this repository, 4 different pretrained neural networks will be tested, both with a finetuning on the dataset as well as using these network as a feature extractor to be followed by ML classification algorithm, in a mix-n-match manner, resulting in 4*#ML+1 number of total models trained.

The used pretrained models are:

ResNet152
EfficientNetV2_L
SwinTransformer_v2_b (if i can get that working)
Densenet201

And the ML algorithms used for classification are:

XGBoost
SVM with RBF Kernel
Random Forest
Logistic Regression (as a baseline)
Quadratic Discriminant Analysis

Known Issues

Since the pretrained models preferred are already bulky to begin with, it performs poorly in lacking hardware. Feel free to increase resources or reduce the model size as prefered.

License

This project is for educational and research purposes. Please check the dataset's original license for usage restrictions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brain Cancer Classification Challenge

Quickstart

Code

Configuration File

Dataset

Some insights on the dataset

Training Approach

Models

Known Issues

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Brain Cancer Classification Challenge

Quickstart

Code

Configuration File

Dataset

Some insights on the dataset

Training Approach

Models

Known Issues

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages