This project aims to develop an advanced system that integrates Automatic Speech Recognition (ASR), Speech Emotion Recognition (SER), and Text Summarizer. The system will address challenges in accurate speech recognition across diverse accents and noisy environments, providing real-time emotional tone interpretation (sentiment analysis), and generating summaries to retain essential information. Targeting applications such as customer service, business meetings, media, and education, this project seeks to enhance documentation, understanding, and emotional context in communication.
- Baseline Model for ASR: CNN-BiLSTM
- Baseline Model for SER: XGBoost
- Baseline Model for Text Summarizer: T5-Small, T5-Base
- Final Model for ASR: Conformer
- Final Model for SER
- Final Model for Text Summarizer: BART Large
- Accurate ASR System: Handle diverse accents and operate effectively in noisy environments
- Emotion Analysis: Through tone of speech
- Meaningful Text Summarizer: Preserve critical information without loss
- Integrated System: Combine all components to provide real-time transcription and summaries
Base Model (CNN-Bi_LSTM) |
Final Model |
---|---|
Base Model (XGBoost) |
Final Model |
---|---|
Base Model (T5-Small, T5-Base) |
Final Model |
---|---|
Important
To clone the repository with its sub-modules, enter the following command:
git clone --recursive https://github.com/LuluW8071/ASR-with-Speech-Sentiment-and-Text-Summarizer.git
Important
Before installing dependencies from requirements.txt
, make sure you have installed
No need to install CUDA ToolKit and PyTorch CUDA for inferencing. But make sure to install PyTorch CPU.
- CUDA ToolKit v11.8/12.1
- PyTorch
- SOX
- For Linux:
sudo apt update sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev # Verify installation sox --version
- For Linux:
pip install -r requirements.txt
2. Configure Comet-ML Integration
Note
Replace dummy_key
with your actual Comet-ML API key and project name in the .env
file to enable real-time loss curve plotting, system metrics tracking, and confusion matrix visualization.
API_KEY = "dummy_key"
PROJECT_NAME = "dummy_key"
Note
--not-convert
if you don't want audio conversion
py common_voice.py --file_path file_path/to/validated.tsv
--save_json_path file_path/to/save/json
-w 4
--percent 10
--output_format wav/flac
Note
--checkpoint_path path/to/checkpoint_file
to load pre-trained model and fine tune on it.
py train.py --train_json path/to/train.json
--valid_json path/to/test.json
-w 4
--batch_size 128
-lr 2e-4
--epochs 20
py extract_sentence.py --file_path file_path/to/validated.tsv
--save_txt_path file_path/to/save/json
Note
Run the Speech_Sentiment.ipynb
first to get the path and emotions table in csv format and downsample all clips.
py downsample.py --file_path path/to/audio_file.csv
--save_csv_path output/path
-w 4
--output_format wav/flac
py augment.py --file_path "path/to/emotion_dataset.csv"
--save_csv_path "output/path"
-w 4
--percent 20
py neuralnet/train.py --train_csv "path/to/train.csv"
--test_csv "path/to/test.csv"
-w 4
--batch_size 256
--epochs 25
-lr 1e-3
Note
Just run the Notebook File in src/Text_Summarizer
directory.
You may need 🤗 Hugging Face Token with write permission file to upload your trained model directly on the 🤗 HF hub.
Project | Dataset Source | |
---|---|---|
ASR | Mozilla Common Voice | |
SER | RAVDESS, CremaD, TESS, SAVEE | |
Text Summarizer | XSum, BillSum |
The code styling adheres to autopep8
formatting.
Project | Base Model Link | Final Model Link |
---|---|---|
ASR | CNN-BiLSTM | Conformer |
SER | XGBoost | |
Text Summarizer | T5 Small-FineTune, T5 Base-FineTune | BART |
Project | Metrics Used |
---|---|
ASR | WER, CER |
SER | Accuracy, F1-Score, Precision, Recall |
Text Summarizer | Rouge1, Rouge2, Rougel, Rougelsum, Gen Len |
Project | Base Model | Final Model Link |
---|---|---|
ASR | CNN-BiLSTM | |
Speech Sentiment | XGBoost | |
Text Summarizer |
Project | Base Model | Final Model Link |
---|---|---|
ASR | CNN-BiLSTM | |
Speech Sentiment | XGBoost | |
Text Summarizer |