Skip to content

dorukresmi/kaggle_brain_cancer_mri

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Brain Cancer Classification Challenge

This project is for learning some classification approaches applied on Brain Cancer MRI data, available on Kaggle

The whole repository is aimed to have more of a benchmarking approach, rather than picking a model and hyperparameter tuning for the best results achievable with that model, though there will be that as well.


Quickstart

Clone the repository:

git clone https://github.com/dorukresmi/kaggle_brain_cancer_mri.git
cd kaggle_brain_cancer_mri

Install dependencies:

python -m venv .venv
pip install -r requirements.txt

Download and prepare the dataset:

  • Download the dataset from Mendeley Data.
  • Place the extracted data in the Brain_Cancer/ directory (or update the path in your argument_parser.py).

Run an experiment:

  • With a config file:
    python main.py --config config.yaml
  • Or directly from the command line:
    python main.py --models_ft resnet152 --seed 42
    python main.py --models_ml resnet152 --classifier xgboost --seed 42

Code

The code is designed to be used via CLI, in an environment that supports parallelism, for easy scale up. For each parameter combination provided by the user, a training job is created in parallel. If the results for givencombination is already existing, the training is skipped.

To incorporate seamless testing of different approaches, a unified model training pipeline is to be developed. The function will take the training choices as parameters, the necessary pipeline being set up automatically. This will allow the user to test different ideas or training schemes just by providing the necessary parameters, maybe even some training parallelism, though the whole process is thought to be trained on (my) moderate laptop.

Configuration File

Here is the list of all possible arguments for the training.

  • --config <PATH>

    Path to the configuration YAML file

  • --models_ft <STR>
    The pretrained vision models to be used for finetuning

  • --models_ml <STR>
    The pretrained vision models to be used for feature extraction

  • --classifier <STR> The machine learning models to be used for classification, to be used in tandem with feature extraction

  • --seeds <INT>
    List of random seeds for reproducible training runs.

  • --parallel <INT>
    Number of parallel jobs to run.


Dataset

The dataset consists of grayscale images of MRI scans of brain cancer patients, taken from across the hospitals of Bangladesh. There are 3 different groups of brain cancer images present in this dataset; meningioma, glioma and pituitary tumor, without the images of healthy individuals.

The original source of the data, with further information:

https://data.mendeley.com/datasets/mk56jw9rns/1

Some insights on the dataset

Some investigations before moving on. Even the dataset has 10.0 score on Kaggle, I wanted to make sure that there are no unpleasant surprises.

Findings:

  • All images are truly grayscale, even though stored as rgb images
  • All images are of the same exact size, 512*512
  • There are duplicates (44 extra images to be exact) in the brain_tumor dataset, I removed them too. This normalized the class sized to 2004 across all classes (nice!)

Training Approach

The initial implementation is to have a direct classifier approach to be trained. Though it must be noted that the cancer to be detected is localized in the specific regions of the brain, thus a better segmentation based approach might be necessary too. For the time being, the classification is prefered for the simplicity.

There is 2 training "branches" in this repository generally speaking. First one is directly finetuning the pretrained vision models with the image dataset we have at hand. Second part is to use the same models with the classification head removed as a feature extractor, to be used for machine learning algorithm training.

Models

As aligned with the learning goals of this repository, 4 different pretrained neural networks will be tested, both with a finetuning on the dataset as well as using these network as a feature extractor to be followed by ML classification algorithm, in a mix-n-match manner, resulting in 4*#ML+1 number of total models trained.

The used pretrained models are:

  • ResNet152
  • EfficientNetV2_L
  • SwinTransformer_v2_b (if i can get that working)
  • Densenet201

And the ML algorithms used for classification are:

  • XGBoost
  • SVM with RBF Kernel
  • Random Forest
  • Logistic Regression (as a baseline)
  • Quadratic Discriminant Analysis

Known Issues

Since the pretrained models preferred are already bulky to begin with, it performs poorly in lacking hardware. Feel free to increase resources or reduce the model size as prefered.


License

This project is for educational and research purposes. Please check the dataset's original license for usage restrictions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages