This project implements a machine learning pipeline for emotion classification from text data. It uses Python, scikit-learn, and NLTK for preprocessing, feature extraction, model training, and evaluation.
-
Data Loading & Persistence:
Uses aPersistenceManagerclass to load and save datasets and trained models. -
Preprocessing:
- Cleans text by lowercasing, removing digits, punctuation, special characters, and stop words (with some negations kept).
- Applies lemmatization.
- Handles missing values in the text column.
- Uses a scikit-learn
Pipelinefor modular preprocessing.
-
Feature Extraction:
- Uses
TfidfVectorizerto convert cleaned text into numerical features.
- Uses
-
Model Training:
- Trains a Support Vector Classifier (SVC) with a pipeline that includes preprocessing and feature extraction.
- Performs cross-validation to estimate model performance.
- Uses
RandomizedSearchCVfor hyperparameter tuning.
-
Evaluation:
- Prints F1 score, accuracy, classification report, and confusion matrix on the test set.
-
Model Saving:
- Saves the best trained model for later use.
-
Install dependencies:
pip install -r requirements.txt
-
Prepare your dataset:
Ensure your data file contains at least two columns:textandemotion. -
Run training:
python train.py
-
Model output:
The best model is saved using thePersistenceManager. -
Streamlit:
streamlit run main.py
train.py— Main script for training, validation, and model saving.preprocess.py— Preprocessing functions and pipeline components.persistence_manager.py— Handles loading and saving of data/models.requirements.txt— Python dependencies.
- Python 3.7+
- scikit-learn
- pandas
- numpy
- nltk
- matplotlib
- The pipeline is designed to be robust to missing text values.
- All preprocessing steps are encapsulated in the pipeline for reproducibility.
- Hyperparameter search is performed for SVC’s
C,gamma, andkernelparameters.