Non-conditional De Novo molecular Generation with Transformer Encoders

Esbenbjerrum/ May 13, 2021/ Blog, Cheminformatics, Machine Learning, Machine Learning and Chemoinformatics, Molecular Generation, Neural Network, PyTorch, SMILES enumeration/ 1 comments

We’ve known since 2016 that LSTM networks can be used to generate novel and valid SMILES strings of novel molecules after being trained on a dataset of existing SMILES strings Se previous blog-post. The SMILES strings generated corresponds to molecules with similar properties as the ones the networks were trained on. A lot of techniques has since been made to further control the generation, transfer-learning, reinforcement learning, conditional generation etc.
However cool LSTMs are, they seem to be giving way to the transformer architecture in NLP. But transformers are not only about transforming things, the attention based encoder and decoder blocks have been used in auto-regressive models for text generation such as the GPT style transformers and in bidirectional encoders such as the BERT style transformers. In this blog-post I’ll showcase a simple example of an auto-regressive way the transformer encoder can be used to generate new SMILES strings.
First an import of some of the usual suspects.

%load_ext tensorboard
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys, os
    '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]'

Then we install RDKit via the kora package. It installs RDKit by downloading and unpacking a pre-compiled tar ball, so it’s super fast to do as I’m running on a fresh instance in Google colab.

!pip install kora -q
import kora.install.rdkit
    [K     |████████████████████████████████| 61kB 3.7MB/s 
    [K     |████████████████████████████████| 61kB 7.8MB/s 

Now that RDKit is installed we can import the needed RDKit modules.

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors

Import of torch and PyTorch lightning

!pip -q install pytorch-lightning
import torch
from import Dataset, DataLoader
from pytorch_lightning import LightningDataModule, LightningModule
    [K     |████████████████████████████████| 849kB 5.1MB/s 
    [K     |████████████████████████████████| 184kB 10.6MB/s 
    [K     |████████████████████████████████| 829kB 14.9MB/s 
    [K     |████████████████████████████████| 112kB 21.6MB/s 
    [K     |████████████████████████████████| 276kB 17.6MB/s 
    [K     |████████████████████████████████| 1.3MB 23.8MB/s 
    [K     |████████████████████████████████| 143kB 35.0MB/s 
    [K     |████████████████████████████████| 296kB 34.2MB/s 
    [?25h  Building wheel for future ( ... [?25l[?25hdone
      Building wheel for PyYAML ( ... [?25l[?25hdone

Creating the tokenizer

To convert the SMILES strings into tensors for the deep neural network, we’ll code a simple character based tokenizer. Other tokenization schemes are possible as well, such as treating atoms as single tokens (e.g. Cl and Br), but this works reasonable good. For a bigger project that includes on-the-fly SMILES augmentation, take a look at The molvecgen project is however targeted at Keras and could use an update to provide tensors in the PyTorch way.

# A simple tokenizer
class SimpleTokenizer():
  def __init__(self):
    self.start = "^"
    self.end = "$"
    self.unknown = "?"
    self.pad = " "
  def create_vocabulary(self, smiles):
    charset = set("".join(list(smiles)))
    #Important that pad gets value 0
    self.tokenlist = [self.pad, self.unknown, self.start, self.end] + list(charset)
  def tokenlist(self):
    return self._tokenlist
  def tokenlist(self, tokenlist):
    self._tokenlist = tokenlist
    #create the dictionaries      
    self.char_to_int = {c:i for i,c in enumerate(self._tokenlist)}
    self.int_to_char = {i:c for c,i in self.char_to_int.items()}
  def vectorize(self, smiles):
      return [self.char_to_int[self.start]] + [self.char_to_int.get(char, self.char_to_int[self.unknown]) for char in smiles] + [self.char_to_int[self.end]]
  def devectorize(self, tensor):
      return [self.int_to_char[i] for i in tensor]
  def n_tokens(self):
    return len(self.int_to_char)

The dataset just customized how to get the data from the dataframe.

# Build a dataset, that returns tensors
class MolDataset(Dataset):
  def __init__(self, data, tokenizer): = data
    self.tokenizer = tokenizer
  def __len__(self):
    return len(
  def __getitem__(self, idx):
    smiles =[idx]
    tensor = self.tokenizer.vectorize(smiles)
    return tensor

Bringing it together in a PyTorch-Lightning datamodule

For the model and training we’ll use PyTorch-lightning as this is a convenient way to organize the data and model into modules. First installation of the PyTorch-lightning package.
To have some SMILES molecules to play with, we will get a tranch from Zinc. Zinc is organized into smaller files in directories based on molecular weight and predicted logP. We just need one that is reasonable fast to download and of an adequate size. Last time I went through how a tokenizer and PyTorch datasets and data-loaders can be integrated into a reusable PyTorch lightning data-module, so I’ll just provide the code here for creating a PyTorch lightning SMILES data-module of a shard from Zinc.
We found that randomizing the atom-order of the molecules and using non-canonical SMILES increased the quality of the generated molecules. So the randomize_smiles_atom_order method has been added to the data-module. Randomizing the atom-order and creating new non-canonical SMILES works as data-augmentation and the dataset size can be increased orders of magnitude in size However, I want to keep this blog-post simple and fast to run through, so I just do it a single time. Even with a single pass the quality of the SMILES generation is improved: This is tested with LSTM networks, but I think it’s reasonable to expect it also to increase the quality when using transformers.

class ZincShardDataModule(LightningDataModule):
    def __init__(self):
        self.tokenizer = SimpleTokenizer()
        self.val_size = 200
        self.train_size = 200000
        self.batch_size = 128
        self.shard = "EEBD" #Shard must be > self.val_size + self.train_size
    def prepare_data(self):
        # called only on 1 GPU, no assignments of state!
        #e.g. use to download_dataset(), tokenize(), build_vocab()
        os.system(f"mkdir -pv {self.shard[0:2]} && wget -c{self.shard[0:2]}/{self.shard}.smi -O {self.shard[0:2]}/{self.shard}.smi")
    def randomize_smiles_atom_order(self, smiles):
        mol = Chem.MolFromSmiles(smiles)
        atom_idxs = list(range(mol.GetNumAtoms()))
        mol = Chem.RenumberAtoms(mol,atom_idxs)
        return Chem.MolToSmiles(mol, canonical=False)
    def custom_collate_and_pad(self, batch):
        #Batch is a list of vectorized smiles
        tensors = [torch.tensor(l) for l in batch]
        <h2>pad and transpose, pytorch RNNs  (and now transformers) expect (sequence, batch) dimensions</h2>
        tensors = torch.nn.utils.rnn.pad_sequence(tensors)
        return tensors
    def setup(self):
        #Load data = pd.read_csv(f"{self.shard[0:2]}/{self.shard}.smi", sep = " ", nrows = self.val_size + self.train_size)
        #Atom order randomize SMILES["smiles"] =["smiles"].apply(self.randomize_smiles_atom_order)
        #Initialize Tokenizer
        #Create splits for train/val
        idxs = np.array(range(len(
        val_idxs, train_idxs = idxs[:self.val_size], idxs[self.val_size:self.val_size+self.train_size]
        self.train_data =[train_idxs]
        self.val_data =[val_idxs]
    def train_dataloader(self):
        dataset = MolDataset(self.train_data, self.tokenizer)
        return DataLoader(dataset, batch_size=self.batch_size, pin_memory=True,collate_fn=self.custom_collate_and_pad)
    def val_dataloader(self):
        dataset = MolDataset(self.val_data, self.tokenizer)
        return DataLoader(dataset, batch_size=self.batch_size, pin_memory=True,collate_fn=self.custom_collate_and_pad, shuffle=False)

It should now be ready to download and prepare the data for training.

mydm = ZincShardDataModule()
smiles zinc_id
0 c1(Oc2ccc(OC)cc2)ccc([N+](=O)[O-])cc1C(NCC(=O)… ZINC000000132042
1 c12cc(NS(c3ccc([N+]([O-])=O)cc3)(=O)=O)ccc1COC2=O ZINC000000178081
2 N(n1c(=O)c2ccccc2nc1)C(c1cc([N+]([O-])=O)c(Cl)… ZINC000000183321
3 O=C(Nn1c(=O)c2ccccc2nc1)c1cc(Cl)ccc1[N+](=O)[O-] ZINC000000183346
4 C(=C/c1ccc([N+]([O-])=O)cc1)\C(=O)Nn1c(=O)c2cc… ZINC000000183354

A quick check that the dataloaders provide tensors.

dl = mydm.val_dataloader()
test_batch = next(iter(dl))
    tensor([[ 2,  2,  2,  ...,  2,  2,  2],
            [28,  6, 16,  ..., 28, 16, 28],
            [15, 15, 15,  ..., 15, 19, 15],
            [ 0,  0,  0,  ...,  0,  0,  0],
            [ 0,  0,  0,  ...,  0,  0,  0],
            [ 0,  0,  0,  ...,  0,  0,  0]])

Properties of the dataset

Let’s quickly test how the molecular weight distribution of the dataset looks like.

PandasTools.AddMoleculeColumnToFrame(,'smiles','ROMol')["MW"] =
_ = plt.hist(["MW"])

This is a relatively narrow distribution with clear cutoffs. It’ll be interesting to see how close the distribution of the generated molecules will be.

Coding the PyTorch-Lightning model

Next up is the creation of the transformer model. For some reason PyTorch has transformer models, transformer layers and such, but not the positional encoding that was used in the original paper. It is however in one of the examples, so we’ll grab the function from there. Basically it creates a positional encoding and adds it to the tensor, so that the transformer model can figure out the distance and order of the tokens.

!wget -c
from model import PositionalEncoding
--2021-05-01 10:51:49--
    Resolving (,,, ...
    Connecting to (||:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 6367 (6.2K)
Saving to: ‘’ 100%[===================>] 6.22K --.-KB/s in 0s 2021-05-01 10:51:49 (62.5 MB/s) - ‘’ saved [6367/6367]

Now we have all we need to create our SMILES generator model as a PyTorch lightning module. PyTorch lightning again expects some standard functions for the trainer to work, namely for getting and optimizer object, and to take a step (for both train and val). Setup of the layers , the forward function and the shared steps are just for our own convenience. The init function runs the save_hyperparameters() function, which saves all the arguments passed at init into a hparams object on the model. This way they will also latter get logged :-). The setup_layers functions creates the layers that we need for the model.
First the embedding layer that converts the tensors of integers into tensors of embedding vectors of the needed dimension (d_model). These are then passed on to the positional encoding object, that adds this extra information. A triangular mask is needed to prevent the model from looking “into the future” on the character that it is supposed to predict in an auto-regressive manner. The encoder is created from the encoder_layer and the layer normalization. Lastly is a fully connected linear layer that converts the vectors (d_model size) into the size of the vocabulary (here 36).
The step function is used by the PyTorch lightning training_ and validation_step functions. A key element in these functions here is the shift between the input and output. The input contains the start token, but not the last token, whereas the output skips the start token and has the last character. This way the first character of the SMILES will be the predicted output that is aligned with the start token. The second position will “see” the start token and the first character of the SMILES and will predict the second character of the SMILES and so forth due to triangular masking that is applied. Also note the self.log calls in the step functions. Here we control what information we want to get logged and when (per step or per epoch).

class SmilesEncoderModel(LightningModule):
    def __init__(
        max_length = 1000,
        max_lr = 1e-3,
        epochs = 50,
        self.save_hyperparameters() #Populates self.hparams from __init__ options.
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters())
        scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=self.hparams.max_lr, 
                       total_steps=None, epochs=self.hparams.epochs, steps_per_epoch=len(self.train_dataloader()),#We call train_dataloader, just to get the length, is this necessary?
                       pct_start=6/self.hparams.epochs, anneal_strategy='cos', cycle_momentum=True, 
                       base_momentum=0.85, max_momentum=0.95,
                       div_factor=1e3, final_div_factor=1e3, last_epoch=-1)
        scheduler = {"scheduler": scheduler, "interval" : "step" }
        return [optimizer], [scheduler]
    def setup_layers(self):
        self.embedding = torch.nn.Embedding(self.hparams.n_tokens, self.hparams.d_model)
        self.positional_encoder = PositionalEncoding(self.hparams.d_model, dropout=self.hparams.dropout)
        encoder_layer = torch.nn.TransformerEncoderLayer(self.hparams.d_model, self.hparams.nhead, self.hparams.dim_feedforward, self.hparams.dropout, self.hparams.activation)
        encoder_norm = torch.nn.LayerNorm(self.hparams.d_model)
        self.encoder = torch.nn.TransformerEncoder(encoder_layer, self.hparams.num_encoder_layers, encoder_norm)
        self.fc_out = torch.nn.Linear(self.hparams.d_model, self.hparams.n_tokens)
    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask
    def forward(self, features):
        mask = self._generate_square_subsequent_mask(features.shape[0]).to(self.device)
        embedded = self.embedding(features)
        positional_encoded = self.positional_encoder(embedded)
        encoded = self.encoder(positional_encoded, mask=mask)
        out_2= self.fc_out(encoded)
        return out_2
    def step(self, batch):
        batch =
        prediction = self.forward(batch[:-1])#Skipping the last char
        loss = torch.nn.functional.cross_entropy(prediction.transpose(0,1).transpose(1,2), batch[1:].transpose(0,1))#Skipping the first char
        return loss
    def training_step(self, batch, batch_idx):
        loss = self.step(batch)
        self.log('loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    def validation_step(self, batch, batch_idx):
        loss = self.step(batch)
        self.log('val_loss', loss, on_step=False, on_epoch=True, prog_bar=True, logger=True)
        return loss

With that in place we can instantiate our PyTorch lightning model. I keep the dimensions low with two layers and d_model of 256 so that training is fast for demonstration purposes. Larger models (e.g. 4-layers) are more the default when it comes to transformers used for chemistry.

mydm.batch_size = 256
model = SmilesEncoderModel(n_tokens=mydm.tokenizer.n_tokens, num_encoder_layers=4, d_model=512, nhead=8, epochs=25, max_lr=8e-4)

Let’s just make a sanity check by passing our test_batch from before through the model.

out = model.forward(test_batch)
    torch.Size([79, 128, 37])

Training the model

Next is just to decide that we want also to log the learning rate and that we want to use the TensorBoard logger when we create the trainer object.

import pytorch_lightning
lr_logger =pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor(logging_interval="epoch")
tb_logger = pytorch_lightning.loggers.TensorBoardLogger('tensorboard_logs/')
trainer = pytorch_lightning.Trainer(
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores

The TensorBoard extension for Google colab is quite convenient to monitor the training interactively, and also for keeping track of different models performance and hyper parameters.

%tensorboard --logdir tensorboard_logs

Now we can start training using the trainer object. Enjoy. I always finds it facinating to watch the training loss drop as training progresses., mydm)

Saving a little bundle with the model, the hyperparameters and the tokenlist from the tokenizer can come in handy later.

import pickle
save_dict = {"state_dict": model.state_dict(), "tokenlist": mydm.tokenizer.tokenlist, "hparams":model.hparams}
pickle.dump(save_dict, open("drive/MyDrive/temp/deNovoFormer_25.pickle","wb"))

Sampling the model

Now that we have trained the model, we can see how the test_batch now looks. It looks a lot less random than before and we note that there’s some padding in the end (position 0) sampled with high probability, possibly after the end token.

out = model.forward(test_batch)

In order to sample a SMILES string, we need to write a small sample loop. The sampler creates an empty tensor and seeds this with the start token. Then this is passed into the model and the probability distribution used to sample the next character which are then added to the tensor, before it’s passes into the model again. Only the slice of the tensor needed for prediction of the next character is passes into the model to save computations. If the stop token is sampled, the iteration stops. This is different from the LSTM based sampling, where simply the hidden states are passed along from prediction to prediction. Here, a longer and longer tensor needs to be passed into the model, which contributes to making the sampling slower.

class Sampler():
  def __init__(self, model, tokenizer):
    self.model = model
    self.tokenizer = tokenizer
    self.max_len = 128
  def sample(self):
    sample_tensor = torch.zeros((self.max_len,1), dtype=torch.long).to(model.device)
    sample_tensor[0,0] = self.tokenizer.char_to_int[self.tokenizer.start]
    with torch.no_grad():
      for i in range(self.max_len-1):
        tensor = sample_tensor[:i+1]
        logits = self.model.forward(tensor)[-1]
        probabilities = torch.nn.functional.softmax(logits, dim=1).squeeze()
        sampled_char = torch.multinomial(probabilities,1)
        sample_tensor[i+1,0] = sampled_char
        if sampled_char == self.tokenizer.char_to_int[self.tokenizer.end]:
    smiles = "".join(self.tokenizer.devectorize(sample_tensor.squeeze().detach().cpu().numpy())).strip("^$ ")
    return smiles
  def sample_multi(self, n):
    smiles_list = []
    for i in range(n):
    return smiles_list

Moving the model to the GPU helps on inference speed"cuda")
    device(type='cuda', index=0)
sampler = Sampler(model, mydm.tokenizer)
smiles_list = sampler.sample_multi(100)

Sampling would be faster if we did it in batches, but to keep things simple, it’s done one-by-one in this example Sampler class.
We’ll parse the generated SMILES strings into RDKit molecules and compute the fraction of unparsable SMILES strings.

parsed_mols = np.array([Chem.MolFromSmiles(s) for s in smiles_list])
    RDKit ERROR: [16:31:59] Can't kekulize mol.  Unkekulized atoms: 16 17 21
    RDKit ERROR: 
    RDKit ERROR: [16:31:59] Can't kekulize mol.  Unkekulized atoms: 0 4 5 6 19 20 22 23 24
    RDKit ERROR: 
    RDKit ERROR: [16:31:59] Can't kekulize mol.  Unkekulized atoms: 0 2 18 21 23
    RDKit ERROR: 
print("Percentage invalid")
(parsed_mols == None).sum()/len(parsed_mols)
    Percentage invalid

The model has definitely learned something about creation SMILES strings character by character, but there may be room for improvement. SMILES augmentation, larger datasets, larger models and longer training would likely improve the results even more.
We can take a look at the molecules


They don’t look that weird to me.
How does the distribution of the molecular weights looks?

_ = plt.hist(list(map(Descriptors.MolWt, parsed_mols[parsed_mols != None])))
plt.axvline(x=min(["MW"]), c="r")
plt.axvline(x=max(["MW"]), c="r")

The molecular weight of the generated molecules are both smaller and larger than in the training set. The distribution of the de novo generated molecules are however centered on the previous distribution.
Are the molecules novel? We’ll create sets of InChi keys and see of there’s overlap.

denovo_inchi = set(map(Chem.inchi.MolToInchi, parsed_mols[parsed_mols != None]))
train_inchi = set(map(Chem.inchi.MolToInchi,
    [1;30;43mStreaming output truncated to the last 5000 lines.[0m
    RDKit WARNING: [16:35:19] WARNING: Charges were rearranged
    RDKit WARNING: [16:35:19] WARNING: Charges were rearranged

This produces a fair set of warning about charges. I think ZINC has charged molecules ready for docking, maybe this is what triggers the warnings. Or perhaps the charge separated representation for nitro groups.


Apparently some of the generated molecules belonged to the train set, so the network has started to learn the dataset by hearth. But still a lot of novel molecules in there.
So I hope this walk-through of using transformers models was instructive. They can certainly be used for de novo generation of molecules, but if they are better than LSTM networks for this particular task would require more careful comparison by keeping the datasets fixed and doing proper hyperparameter tuning for both model types. I think they are a bit more complicated to work with than LSTM networks. However, as we saw in a previous blog-post, LSTM networks fails quite miserably when it comes to molecular transformation tasks such as reaction prediction, and I hope to soon get the time to write a blog-post about implementing a transformer model for reaction prediction.

Share this Post

1 Comment

  1. Pingback: Cambridge Cheminformatics Newsletter, 20 May 2021 – DrugDiscovery.NET – AI in Drug Discovery

Leave a Comment

Your email address will not be published. Required fields are marked *