Skip to content

Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)

Notifications You must be signed in to change notification settings

interscript/rababa

Repository files navigation

رُبابَة RABABA the Middle-Eastern Language Diacritization Library

Middle-Eastern Language diacritization is useful for several practical business cases like text to speech or Romanization of texts or scripts.

As of now, this library supports Hebrew and Arabic.

Purpose

This repository contains everything to train a diacritization model in Python and run it in Python and Ruby.

Try out Rababa

Rababa can be run both in Python and Ruby. Go the directory corresponding to the language you prefer to use.

Please see the following README’s, under the “Try out Rababa” section:

Library

This library was built for the Interscript project (at GitHub).

Diacritization strategy is following several steps with at heart a deep learning model:

  1. text preprocessing

  2. neural networks model prediction

  3. text postprocessing

This repository contains:

  • lib is the Ruby library using NNet model in ONNX format.

  • docs contains an application focused summary of latest (2021-06) relevant papers and solutions.

  • python

    • A neural network solution for automatised diacritization based on the work of almodhfer, from which we overtook the baseline and more advanced and efficient CBHG models only. This very recent solution allows for efficient predictions on CPU’s with a reasonable sized model.

    • PyTorch to ONNX conversion of PyTorch to ONNX format

    • Strings Pre-/Post-processing, also from almodhfer

  • tests and benchmarking utilities, allowing to compare with other implementations.

  • models-data directory to store models and embeddings in various formats

About the name

A Rababa is an antique string instrument.

In a similar fashion that a Rababa produces melody from a simple strings and pieces of wood, our library and diacritization gives a whole palette of colour and meanings to arabic scripts.

Under development

We are working on the following improvements:

  • Enhancing architecture and encoding

  • Enhance datasets to improve models