Neural machine translation chatbot

EDIT: This project stands as a testament to the significant advancements in the field of machine learning, particularly with the development of large language models such as OpenAI's GPT series (e.g., GPT-3.5, GPT-4), Google's Gemini, and Meta's LLaMA. While these models address similar challenges, this repository offers a unique approach using different techniques.

The goal of this project is to explore the feasibility of creating artificial conversational agents, or chatbots, utilizing novel sequence-to-sequence methods inspired by progress in natural language processing and neural machine translation (NMT).

Dataset

The NMT model is trained on a dataset comprising comments from Reddit, which encompasses every publicly available comment posted on the platform since 2005. The dataset can be accessed through the provided link. An NMT will be trained on a dataset of comments from Reddit provided by this link. This repository contains every publicly available comment posted to reddit since 2005.

Hypothesis

The hypothesis driving this project is that by feeding comment-response pairs to the NMT, it will learn to associate similar responses with their corresponding comments. With a sufficiently large dataset and adequate computational resources, the model is expected to generate coherent responses to any given input.

Functional Description

Dataframe Creation: Generate a dataframe upon request, pickle its tensor representation, and save tokenizers for future use. If not requested, utilize existing resources.
Training Parameter Initialization: Set training parameters including the number of epochs, buffer size, batch size, embedding dimension, and the number of hidden units for both encoding and decoding. Also, determine the vocabulary sizes for the input and output corpora.
Tensor Batching: Create batches of tensors from the dataset for training.
Model Components Initialization: Initialize the encoder, decoder, and optimizer.
Checkpoint System: Implement a checkpoint system to save model states during training, allowing for recovery in case of interruptions or for comparative analysis of model iterations.

The project is extensibly commented so any additional information can be found in the code itself.

Neural Network Model

The repository contains a model trained with the following charactersistics:

number of epochs = 30
batch size = 10
embedding dimension = 256
number of hidden recurrent units = 256
optimizer = Adam optimizer
max word length of sentences = 10
final loss = 0.9.

Note: More sophisticated models were developed but were too large to be hosted on this GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
checkpoints/5k_20w		checkpoints/5k_20w
datasets		datasets
results		results
tensors		tensors
tokenizers		tokenizers
README.md		README.md
create_database.py		create_database.py
create_dataset.py		create_dataset.py
decoder.py		decoder.py
encoder.py		encoder.py
evaluator.py		evaluator.py
main.py		main.py
results.txt		results.txt
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural machine translation chatbot

Dataset

Hypothesis

Functional Description

Neural Network Model

Results

About

Releases

Packages

Languages

JustCallMeRob/neural-machine-translation-chatbot

Folders and files

Latest commit

History

Repository files navigation

Neural machine translation chatbot

Dataset

Hypothesis

Functional Description

Neural Network Model

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages