The goal of the assignment is to build a sentiment categorization system for business reviews from annotated data. The input of the code will be a set of annotated business reviews from the website Yelp.
Training dataset of business reviews can be downloaded from this link: https://owncloud.iitd.ac.in/nextcloud/index.php/s/2KRyxd9XLFcnpXR
. Each data point has review text, and a rating. Ratings are floating point numbers between 1 and 5.
Build a non-neural classifier that given a review, predicts its sentiment polarity.
knowledge.py
: Containes of in-domain stopwords, positive and negative words etc.preprocess.py
: Methods for preprocessing the raw data.features.py
: Methods for extracting different features from the pre-processed data.form_matrix.py
: Methods to form feature matrix from words and the features formed using previous script.predict.py
: Methods for building and training the model given feature and label matrix.variables.py
: Global tunable variables for selecting the preprocessing steps, features to use, vocabulary building method, model and other hyperparameters.train.py
: Main file to run for training the model given the data.test.py
: For making predictions on test data.compile.sh
: Compiling the whole code.train.sh
: Runsh ./train.sh trainfile.json devfile.json model_file
to train the model withtrainfile.json
as training data anddevfile.json
as validation data. Model will be saved inmodel_file
test.sh
: Run./test.sh model_file testfile.json outputfile.txt
to test the trained model ontestfile.json
. A new fileoutputfile.txt
containing a sequence of numbers between 1 and 5, representing the predictions, will created.