Skip to content

An implementation of a SMILES-to-SMILES transformer

Notifications You must be signed in to change notification settings

gmattedi/SmilesTransformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMILES Transformer

A SMILES-to-SMILES Transformer implementation

Input data

Any set of SMILES or pairs of SMILES strings in CSV-like format can be used.
A cleaned version of ChEMBL 30 is provided in data/chembl_30.

Generating the token alphabet

The token alphabet can be generated from the ChEMBL training set with:

cd SmilesTransformer/tokenizer
python tokenizer.py
# Tokens are saved in alphabet.dat

A precomputed set can be found here.

Model configuration

The transformer can be instantiated from a JSON file (i.e. config.json)

{
  "n_src_vocab": 44,
  "n_tgt_vocab": 44,
  "len_max_seq": 100,
  "d_word_vec": 512,
  "d_model": 512,
  "d_inner": 2048,
  "n_layers": 6,
  "n_head": 8,
  "d_k": 64,
  "d_v": 64,
  "dropout": 0.1,
  "tgt_emb_prj_weight_sharing": true,
  "emb_src_tgt_weight_sharing": true
}

In the example provided, the alphabet length is that of the tokens of the ChEMBL 30 training set

Training the model

You can train the model on a training and validation subsample of ChEMBL 30 of 1000 and 50 molecules, respectively, by:

python SmilesTransformer/main.py \
    -c config.json \
    --train_path data/chembl_30/chembl_30_chemreps_proc_train.csv.gz \
    --val_path data/chembl_30/chembl_30_chemreps_proc_valid.csv.gz \
    --alphabet_path SmilesTransformer/tokenizer/alphabet.dat \
    --sample_train 1000 \
    --sample_val 50 \
    --train_batch_size 64 \
    --val_batch_size 64 \
    --src_smiles_col SMILES \
    --tgt_smiles_col SMILES \
    --num_epochs 10 --augment 1 --checkpoint_folder .

This training set size is not nearly enough to get any meaningful performance, but demonstrates the basic functioning of the model.

In this case we are training the transformer to reconstruct the original SMILES strings, but this can be trivially adapted to predicting different target SMILES strings by providing training and test CSV files of pairs of molecules.

Credits

This repository uses the vanilla transformer implementation by siat-nlp.
The SMILES tokenization regex pattern is from the Molecular Transformer.

References

[1] Vaswani et al., Attention Is All You Need, NIPS( 2017).

[2] A PyTorch implementation attention-is-all-you-need-pytorch.

About

An implementation of a SMILES-to-SMILES transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages