Skip to content

The purpose of this project is to build a machine learning model to classify SMS messages as either "spam" or "ham" (not spam). Using TF-IDF vectorization and LinearSVC, it reads an SMS dataset, transforms text data into numerical features, and trains a model to distinguish between spam and ham. The "SMSSpamCollection" dataset has labeled messages.

Notifications You must be signed in to change notification settings

jmarihawkins/sms_spam_detector

Repository files navigation

SMS Spam Dectector Project

Summary

The purpose of this project is to build a machine learning model to classify SMS messages as either "spam" or "ham" (not spam). This is achieved using a pipeline that combines TF-IDF vectorization and Linear Support Vector Classification (LinearSVC). The project involves reading a dataset of SMS messages, transforming the text data into numerical features suitable for machine learning, and training a model to distinguish between spam and ham messages. The dataset, "SMSSpamCollection," contains labeled SMS messages which provide the basis for supervised learning. Each message is labeled as either 'spam' or 'ham,' allowing the model to learn the distinguishing features of each category.

The project follows a systematic approach, beginning with data preprocessing where the raw SMS data is cleaned and prepared for analysis. This includes handling missing values, normalizing text, and splitting the data into training and testing sets. The TF-IDF vectorizer converts the text messages into a matrix of TF-IDF features. This step is critical because it quantifies the importance of words in each message relative to the entire dataset, thus providing meaningful numerical representations of the text.

After preprocessing, a machine learning pipeline is constructed. This pipeline integrates TF-IDF vectorization and LinearSVC, enabling the transformation of text data and training of the classification model in a streamlined manner. The choice of LinearSVC is driven by its efficiency and effectiveness in high-dimensional spaces, common in text data. The model is trained on the training set and evaluated on the testing set to ensure it generalizes well to new, unseen data.

The final component of the project is the creation of a user-friendly interface using Gradio. This interface allows users to input new SMS messages and receive real-time predictions on whether the message is spam or ham. The Gradio interface enhances the accessibility of the model, making it easy to use even for those without a technical background. This project not only demonstrates the process of building a text classification model but also showcases how such a model can be deployed in a practical application for real-world use.

Functionality

  • import pandas as pd

    • Used for data manipulation and analysis.
    • Pandas is needed to load and preprocess the SMS dataset.
  • from sklearn.model_selection import train_test_split

    • Used to split the dataset into training and testing sets.
    • This ensures that the model can be evaluated on unseen data to measure its performance.
  • from sklearn.pipeline import Pipeline

    • Used to create a pipeline that chains together multiple steps in the machine learning process.
    • This makes the code cleaner and more modular.
  • from sklearn.feature_extraction.text import TfidfVectorizer

    • Converts text data into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).
    • This step is crucial for transforming the raw text messages into a format that can be used by the machine learning model.
  • from sklearn.svm import LinearSVC

    • A Support Vector Classification model that is used for the classification task.
    • LinearSVC is chosen for its effectiveness in text classification problems.
  • pd.set_option('max_colwidth', 200)

    • Sets the maximum column width to display text message data.
    • This is useful for viewing the content of SMS messages without truncation.
  • import gradio as gr

    • Used to create a web interface for the model.
    • Gradio allows users to input new SMS messages and receive predictions in a user-friendly way.
  • def sms_classification(sms_text_df)

    • Main function to perform SMS classification.
    • Takes a DataFrame containing SMS messages and their labels, splits the data into training and testing sets, builds a pipeline with TF-IDF vectorization and LinearSVC, and fits the model to the training data.
  • def predict_sms(text_clf, message)

    • Function to predict if a new SMS message is spam or ham.
    • Uses the trained model to classify the input message and returns the prediction.
  • iface = gr.Interface(fn=predict_sms, inputs="text", outputs="text", title="SMS Spam Classifier", description="Enter an SMS message to classify it as spam or ham.")

    • Creates a Gradio interface for the SMS classification model.
    • Allows users to input an SMS message and receive a classification prediction.

Real-World Application

  1. Customer Support Services

    • Scenario: A customer support center for a telecom company uses this model to filter incoming SMS messages to identify spam. This ensures that only legitimate messages are processed, reducing the load on customer service representatives and improving response times.
  2. Email Filtering Systems

    • Scenario: An email service provider integrates a similar spam detection model to classify and filter spam emails from users' inboxes. This helps in maintaining a clean and spam-free email experience for users.
  3. Social Media Monitoring

    • Scenario: A social media platform uses a text classification model to detect and filter spam messages or posts. This ensures a better user experience by minimizing spam content on the platform.

Conclusion

In conclusion, the SMS Spam Classification project effectively demonstrates the process of building a machine learning model for text classification. By leveraging the power of TF-IDF vectorization and Linear Support Vector Classification, the project showcases a robust approach to handling textual data and transforming it into meaningful insights. The integration of these techniques within a streamlined pipeline not only simplifies the workflow but also ensures consistency and reproducibility of results. The careful selection of the SMS dataset, with its clear labels and real-world relevance, underscores the practical applicability of the model in various domains. Additionally, the deployment of the model via a user-friendly Gradio interface highlights the importance of accessibility and ease of use, making advanced machine learning techniques available to a broader audience. Two specific findings from this project are:

  1. The Role of Feature Engineering: The choice and transformation of features, particularly through the use of TF-IDF vectorization, significantly impact the performance of the machine learning model. Effective feature engineering enables the model to capture the underlying patterns in the text data, improving its ability to accurately classify messages as spam or ham. This step highlights the critical role that thoughtful feature selection and transformation play in the success of text classification tasks.

  2. The Effectiveness of LinearSVC for Text Classification: LinearSVC has proven to be a robust and efficient algorithm for text classification tasks. Its ability to handle high-dimensional data and provide accurate predictions makes it a suitable choice for the SMS spam classification task. The combination of TF-IDF vectorization and LinearSVC in the pipeline has resulted in a model that performs well in identifying spam messages, making it a valuable tool for automated text classification applications.

About

The purpose of this project is to build a machine learning model to classify SMS messages as either "spam" or "ham" (not spam). Using TF-IDF vectorization and LinearSVC, it reads an SMS dataset, transforms text data into numerical features, and trains a model to distinguish between spam and ham. The "SMSSpamCollection" dataset has labeled messages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published