Sentiment Analysis

This repository contains the code and results from sentiment analysis on movie reviews. The project explores two different approaches to sentiment detection: one using Bag of Words (BoW) and Supervised Learning, and the other using SentiWordNet and Unsupervised Learning (knowledge-based).

File structure

The files and folders are organized the following way:

├── data
│   ├── original_data
│   ├── results
|   └── synsets
├── media
├── compare_sup_unsup.ipynb
├── frequencies.py
├── split_dataset.py
├── supervised.ipynb
├── textserver.py
├── ukb_graph.gexf
├── ukb.py
└── unsupervised.ipynb

data: Contains the dataset, results, and synset data used in the project.
media: Stores images and visualizations used in the README and reports.
supervised.ipynb: Main notebook for the supervised learning approach.
unsupervised.ipynb: Main notebook for the unsupervised learning approach.
compare_sup_unsup.ipynb: Notebook comparing the results of the supervised and unsupervised approaches.
ukb.py: Contains the pseudo-implementation of the UKB disambiguation algorithm.
Other scripts: Assist with data processing and model evaluation.

Data sources & resources

The data used was the Movie Reviews Corpus, downloaded via nltk, consisting of 1000 positive and 1000 negative movie reviews (we have split in 25% test, 18.75% val and 56.25% train).

The supervised models use the library scikit-learn.

For the unsupervised part, SentiWordNet is used, as well as Spacy and our pseudo-implementation of UKB (in the file ukb.py).

Supervised Learning

The supervised learning approach involves preprocessing the text. We have selected three different methods: just calling CountVectorizer, lemmatizing, and then CountVectorizer and binary CountVectorizer.

The models tried are:

GradientBoosting
AdaBoost
RandomForest
LogisticRegression
SVC
MLPClassifier

A Grid Search Cross Validation has been done to find the best model (accuracy) with the best parameters. This took +100 minutes.

Results

The best model was a RandomForest with 1500 estimators, max depth 14, and using the binary Count Vectorizer (0.8613 accuracy val). With the test partition, the obtained accuracy was 0.854.

Unsupervised Learning

The first step for this part was to do synset disambiguation, so we could later use SentiWordNet. We have tried three different ways of performing this task:

Using POS tagging and Lesk algorithm
Using a custom implementation of UKB
Using the most freqüent synset (based on SemCor)

Pseudo-UKB

UKB is a SOTA word sense disambiguator, described in this paper, this other... by the IXA group of the University of the Basque Country (you can find more information here).

In our pseudo-implementation, we revised the different methods described in their paper, using the library networkx (UKB uses PageRank). However, in the end, we couldn't use this method for the complete dataset, since it was too slow, and we only tested it with a subset of the data.

Obtaining a score

With the synsets, we can use SentiWordNet to obtain a polarity score of each word (pos, neg, or obj). However, we can get some other metrics from these scores, like the max score, the difference between pos and neg, the difference with a threshold... You can also filter words using their POS tag, like only using nouns, adjectives...

We can also use different methods to get the scores of a sentence: sum, mean, max, min, norm2 mean... To join different sentences, we decided to just use the mean.

Finally, with the score of a review, we still need to determine a threshold upon which we decide that score to be positive.

The best configuration

We have created a "validation" partition to test all the different ways to calculate the scores. The whole search took 170 minutes, and these are the best-performing ones:

Results

The final configuration chosen was using all POS except verbs, score dif, merge using sum, and threshold 0. The accuracy was 0.644.

Other alternatives were also tried to obtain better results (VADER, negation detection...) but they ended up being worse.

Conclusions

It is pretty clear that, even though the unsupervised learning techniques could generate some predictions, the supervised approach obtained much better results.

However, with the use of bigger models, embeddings, and architectures like Transformers, these results could be further improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis

Table of Contents

File structure

Data sources & resources

Supervised Learning

Results

Unsupervised Learning

Pseudo-UKB

Obtaining a score

The best configuration

Results

Conclusions

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
media		media
.gitignore		.gitignore
README.md		README.md
Report_Sentiment_Analysis_Pau_Hidalgo_Cai_Selvas.pdf		Report_Sentiment_Analysis_Pau_Hidalgo_Cai_Selvas.pdf
compare_sup_unsup.ipynb		compare_sup_unsup.ipynb
frequencies.py		frequencies.py
split_dataset.py		split_dataset.py
supervised.ipynb		supervised.ipynb
textserver.py		textserver.py
ukb.py		ukb.py
ukb_graph.gexf		ukb_graph.gexf
unsupervised.ipynb		unsupervised.ipynb

pauhidalgoo/sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis

Table of Contents

File structure

Data sources & resources

Supervised Learning

Results

Unsupervised Learning

Pseudo-UKB

Obtaining a score

The best configuration

Results

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages