Skip to content

Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision

Notifications You must be signed in to change notification settings

DavidOgalo/Twitter-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Sentiment Analysis on Social Media Data (Twitter)

Description

Conceptualized and developed a sentiment analysis model to quantify the positivity of tweets across diverse geographic regions. Leveraged advanced Natural Language Processing (NLP) techniques, including count vectorization, TF-IDF, and Doc2Vec, to extract meaningful insights from unstructured text data. This project involved extensive data handling and pre-processing, sophisticated machine learning algorithms, and rigorous model evaluation and validation to ensure robust and reliable performance.

Key Concepts

Data Handling and Pre-processing

  • Data Cleaning: Processed unstructured text data to handle missing values and duplicates, ensuring high-quality input for model training.
  • Feature Engineering: Utilized count vectorization, TF-IDF, and Doc2Vec to create meaningful features from raw text data, enhancing the model's ability to understand sentiment.
  • Data Visualization: Used libraries like Seaborn and Matplotlib to visualize sentiment distribution across regions, helping to identify patterns and trends in the data.

Machine Learning Algorithms

  • Supervised Learning: Trained the sentiment analysis model using supervised learning techniques on labeled tweet data, focusing on accurately classifying sentiment.
  • Supervised Learning: Applied clustering methods to explore patterns in sentiment data, providing additional insights into the data's structure.

Natural Language Processing (NLP)

  • Text Pre-processing: Implemented tokenization, stemming, and lemmatization using NLTK to standardize and clean the text data, making it suitable for analysis.
  • NLP Models: Leveraged advanced models like Doc2Vec for feature extraction, capturing semantic meaning from the text data.
  • Libraries: Utilized NLTK and Gensim for various NLP tasks, ensuring robust and efficient text processing.

Model Evaluation and Validation

  • Metrics: Assessed model performance using metrics such as accuracy, precision, recall, and F1 score to ensure a comprehensive evaluation.
  • Cross-Validation: Conducted k-fold cross-validation to validate model stability and robustness, ensuring the model generalizes well to unseen data.
  • A/B Testing: Performed A/B testing to evaluate model changes and improvements, ensuring continuous enhancement of model performance.

Technologies (Tools and Libraries)

  • Python==3.6: Primary programming language used for the project.
  • NLTK==3.4.5: Used for text preprocessing tasks such as tokenization, stemming, and lemmatization.
  • Gensim==3.8.3: Employed for advanced NLP tasks including the implementation of Doc2Vec.
  • Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions.
  • Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions.
  • Seaborn==0.10.1: Enhanced data visualization capabilities for better presentation of sentiment analysis results.
  • scikit-learn==0.21.3: scikit-learn: Used for machine learning model training and evaluation.

Project Breakdown

Part 1: Data Collection and Pre-processing

  • Data Collection: Gathered tweets using the Twitter API, ensuring a diverse dataset across various geographic regions. Also used a sample set from kaggle containing tweets extracted using the twitter API.
  • Data Cleaning: Processed the raw tweet data to handle missing values, duplicates, and irrelevant content.

Part 2: Feature Engineering

  • Count Vectorization: Transformed text data into numerical vectors using count vectorization.
  • TF-IDF: Applied Term Frequency-Inverse Document Frequency to weigh the importance of words in the dataset.
  • Doc2Vec: Used Doc2Vec to capture the semantic meaning of tweets, enhancing feature representation.

Part 3: Model Training and Tuning

  • Supervised Learning: Trained a sentiment analysis model using labeled data, employing algorithms like logistic regression and support vector machines.
  • Hyperparameter Tuning: Optimized model parameters to improve performance using techniques like grid search.

Part 4: Model Evaluation and Validation

  • Metrics: Evaluated model performance using accuracy, precision, recall, and F1 score.
  • Cross-Validation: Conducted k-fold cross-validation to ensure model robustness and generalizability.
  • A/B Testing: Implemented A/B testing to compare different model versions and select the best-performing model.

Getting Started

  1. Clone the Repository
  2. Install Dependencies: Manually install the required tools and libraries highlighted in the technologies section, versions are specified.
  3. Dataset: Download the dataset using the Twitter API or a sample dataset from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140) and place it in the designated directory.
  4. Run the Preprocessing Script: Preprocess the tweets using the provided scripts to clean and standardize the data.
  5. Feature Engineering: Execute the feature engineering scripts to transform the text data into numerical features.
  6. Train the Model: Use the training scripts to build and optimize the sentiment analysis model.
  7. Evaluate the Model: Run the evaluation scripts to assess the model performance using various metrics and validation techniques.

Maintainers and Contributors

Maintainer: David Ogalo
Contributors: Contributions are welcome. Please reach out for more information on contribution guidelines on this project.

About

Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published