Skip to content

Automatic Speech Recognition using Conformer with Speech Sentiment Analysis & Text Summarizer

License

Notifications You must be signed in to change notification settings

LuluW8071/ASR-with-Speech-Sentiment-and-Text-Summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASR-with-Speech-Sentiment-&-Text-Summarizer

Code in Progress License Open Issues Closed Issues Open PRs Repo Size Contributors Last Commit

Introduction

Animate

This project aims to develop an advanced system that integrates Automatic Speech Recognition (ASR), Speech Emotion Recognition (SER), and Text Summarizer. The system will address challenges in accurate speech recognition across diverse accents and noisy environments, providing real-time emotional tone interpretation (sentiment analysis), and generating summaries to retain essential information. Targeting applications such as customer service, business meetings, media, and education, this project seeks to enhance documentation, understanding, and emotional context in communication.

Intermediate Goals

  • Baseline Model for ASR: CNN-BiLSTM
  • Baseline Model for SER: XGBoost
  • Baseline Model for Text Summarizer: T5-Small, T5-Base
  • Final Model for ASR: Conformer
  • Final Model for SER
  • Final Model for Text Summarizer: BART Large

Goals

  • Accurate ASR System: Handle diverse accents and operate effectively in noisy environments
  • Emotion Analysis: Through tone of speech
  • Meaningful Text Summarizer: Preserve critical information without loss
  • Integrated System: Combine all components to provide real-time transcription and summaries

Contributors

Project Architecture

1. ASR (Automatic Speech Recognition)

Base Model
(CNN-Bi_LSTM)
Final Model
Base Model Final Model

2. SER (Speech Emotion Recognition)

Base Model
(XGBoost)
Final Model
Base Model Code in Progress

3. Text Summarizer

Base Model
(T5-Small, T5-Base)
Final Model
Base Model Code in Progress

High Level Next Steps

Usage

Clone the Repository

Important

To clone the repository with its sub-modules, enter the following command:

git clone --recursive https://github.com/LuluW8071/ASR-with-Speech-Sentiment-and-Text-Summarizer.git

1. Install Required Dependencies

Important

Before installing dependencies from requirements.txt, make sure you have installed
No need to install CUDA ToolKit and PyTorch CUDA for inferencing. But make sure to install PyTorch CPU.

  • CUDA ToolKit v11.8/12.1
  • PyTorch
  • SOX
    • For Linux:
      sudo apt update
      sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev
      
      # Verify installation
      sox --version
pip install -r requirements.txt

2. Configure Comet-ML Integration

Note

Replace dummy_key with your actual Comet-ML API key and project name in the .env file to enable real-time loss curve plotting, system metrics tracking, and confusion matrix visualization.

API_KEY = "dummy_key"
PROJECT_NAME = "dummy_key"

Usage Instructions

ASR (Automatic Speech Recognition)

1. Audio Conversion

Note

--not-convert if you don't want audio conversion

py common_voice.py --file_path file_path/to/validated.tsv
                   --save_json_path file_path/to/save/json
                   -w 4
                   --percent 10
                   --output_format wav/flac

2. Train Model

Note

--checkpoint_path path/to/checkpoint_file to load pre-trained model and fine tune on it.

py train.py --train_json path/to/train.json
            --valid_json path/to/test.json
            -w 4 
            --batch_size 128 
            -lr 2e-4 
            --epochs 20

3. Sentence Extraction

py extract_sentence.py --file_path file_path/to/validated.tsv
                       --save_txt_path file_path/to/save/json

Speech Sentiment

1. Audio Downsample and Augment

Note

Run the Speech_Sentiment.ipynb first to get the path and emotions table in csv format and downsample all clips.

py downsample.py --file_path path/to/audio_file.csv 
                 --save_csv_path output/path 
                 -w 4 
                 --output_format wav/flac
py augment.py --file_path "path/to/emotion_dataset.csv" 
              --save_csv_path "output/path" 
              -w 4 
              --percent 20

2. Train the Model

py neuralnet/train.py --train_csv "path/to/train.csv" 
                      --test_csv "path/to/test.csv" 
                      -w 4 
                      --batch_size 256 
                      --epochs 25 
                      -lr 1e-3

Text Summarization

Note

Just run the Notebook File in src/Text_Summarizer directory. You may need 🤗 Hugging Face Token with write permission file to upload your trained model directly on the 🤗 HF hub.

Data Source

Project Dataset Source
ASR Mozilla Common Voice
SER RAVDESS, CremaD, TESS, SAVEE
Text Summarizer XSum, BillSum

Code Structure

The code styling adheres to autopep8 formatting.

Results

Project Base Model Link Final Model Link
ASR CNN-BiLSTM Conformer
SER XGBoost Train in Progress
Text Summarizer T5 Small-FineTune, T5 Base-FineTune BART

Metrics Used

Project Metrics Used
ASR WER, CER
SER Accuracy, F1-Score, Precision, Recall
Text Summarizer Rouge1, Rouge2, Rougel, Rougelsum, Gen Len

Loss Curve Evaluation

Project Base Model Final Model Link
ASR CNN-BiLSTM Train in Progress
Speech Sentiment XGBoost Train in Progress
Text Summarizer T5 Base Model Loss Train in Progress

Evaluation Metrics Results

Project Base Model Final Model Link
ASR CNN-BiLSTM Train in Progress
Speech Sentiment XGBoost Train in Progress
Text Summarizer T5 Base Model Metrics
T5 Base Model Metrics
Train in Progress