Update(Apr 6, 2021) This repo is based on Sentiment Analysis on Tweets.
We use and compare various different methods for sentiment analysis on tweets (a multi-class classification problem).
Download the dataset here.
If you want to run models like CNN or LSTM, a pre-trained word embedding file is required. Download here. In this project we use glove.twitter.27B.zip.
There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.
numpyscikit-learnscipynltk
The library requirements specific to some methods are:
keraswithTensorFlowbackend for Logistic Regression, MLP, RNN (LSTM), and CNN.xgboostfor XGBoost.
Note: It is recommended to use Anaconda distribution of Python.
Make sure the dataset and glove embedding file is in the twitter-sentiment-analysis/dataset directory. Rename the dataset file to training.csv. We don't need test file here since the training dataset can be split to 10% of validation data.
Make sure create a directory models in the directory twitter-sentiment-analysis/code.
As for embedding file, use glove.twitter.27B.200d.txt since the dimension of embedding defined in cnn.py is 200.
- Run
preprocess.py <raw-csv-path>on both train and test data. This will generate a preprocessed version of the dataset. - Run
stats.py <preprocessed-csv-path>where<preprocessed-csv-path>is the path of csv generated frompreprocess.py. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.
After the above steps, you should have four files in total: <preprocessed-train-csv>, <preprocessed-test-csv>, <freqdist>, and <freqdist-bi> which are preprocessed train dataset, preprocessed test dataset, frequency distribution of unigrams and frequency distribution of bigrams respectively.
For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.
- Run
baseline.py. WithTRAIN = Trueit will show the accuracy results on training dataset.
- Run
naivebayes.py. WithTRAIN = Trueit will show the accuracy results on 10% validation dataset.
- Run
logistic.pyto run logistic regression model OR runmaxent-nltk.py <>to run MaxEnt model of NLTK. WithTRAIN = Trueit will show the accuracy results on 10% validation dataset.
- Run
decisiontree.py. WithTRAIN = Trueit will show the accuracy results on 10% validation dataset.
- Run
randomforest.py. WithTRAIN = Trueit will show the accuracy results on 10% validation dataset.
- Run
xgboost.py. WithTRAIN = Trueit will show the accuracy results on 10% validation dataset.
- Run
svm.py. WithTRAIN = Trueit will show the accuracy results on 10% validation dataset.
- Run
neuralnet.py. Will validate using 10% data and save the best model tobest_mlp_model.h5.
- Run
lstm.py. Will validate using 10% data and save models for each epock in./models/. (Please make sure this directory exists before runninglstm.py).
- Run
cnn.py. This will run the 4-Conv-NN (4 conv layers neural network) model as described in the report. To run other versions of CNN, just comment or remove the lines where Conv layers are added. Will validate using 10% data and save models for each epoch in./models/. (Please make sure this directory exists before runningcnn.py).
- To extract penultimate layer features for the training dataset, run
extract-cnn-feats.py <saved-model>. This will generate 3 files,train-feats.npy,train-labels.txtandtest-feats.npy. - Run
cnn-feats-svm.pywhich uses files from the previous step to perform SVM classification on features extracted from CNN model. - Place all prediction CSV files for which you want to take majority vote in
./results/and runmajority-voting.py. This will generatemajority-voting.csv.
dataset/positive-words.txt: List of positive words.dataset/negative-words.txt: List of negative words.dataset/glove-seeds.txt: GloVe words vectors from StanfordNLP which match our dataset for seeding word embeddings.Plots.ipynb: IPython notebook used to generate plots present in report.