Building a simple QSAR model using a feed forward neural network in PyTorch

Esbenbjerrum/ May 1, 2020/ Blog, Cheminformatics, Machine Learning, Neural Network, RDkit/ 14 comments

In my previous blogposts I’ve entirely been using Keras for my neural networks. Keras as a stand-alone is now no longer active developed, but are instead now an integrated part of TensorFlow, which are developed and open-sourced by Google. Pytorch is another framework for building artificial neural network models, which came out of Facebooks AI research and was open-sourced. In this blog-post I show how to build a simple feed-forward neural network in PyTorch for making a quantitative structure-activity relationship model (QSAR).

Download data and prepare data

ExcapeDB ( is a project developed as part of the BigChem project. It’s a treasure trove of bioactivity data and a nice feature is the possibility to download bulk datasets from the interface after setting search critera. I downloaded the active molecules that had an associated pXC50 value of the SLC6A4 gene. I set the Gene Symbol to SLC6A4 and “Active” in results from the interface and download the dataset in CSV format. The downloaded file was renamed to SLC6A_active_excape_export.csv. SLC6A4 is a serotonine transporter that transport serotonine from the synaptic cleft back into the neurons. It’s a target of many antidepressants, which is likely why there’s collected a lot of measurements on the activity of different compounds.
I’ll import some libraries to use later and use pandas to load the CSV file and manage the data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = pd.read_csv("SLC6A4_active_excape_export.csv")
Ambit_InchiKey Original_Entry_ID Entrez_ID Activity_Flag pXC50 DB Original_Assay_ID Tax_ID Gene_Symbol Ortholog_Group SMILES
0 DFPSUOZWGQDKPI-COPLHBTANA-N 45380180 6532 A 8.40012 pubchem 462683 9606 SLC6A4 4061 ClC=1C=C([C@]23[C@H]([C@@H]2CN(C)C)CNC3)C=CC1Cl

Loading the CSV file gives us a dataframe with the information. The interesting columns for building the QSAR model in PyTorch is the structure, which are encoded in the SMILES column, and the pXC50 which is the – log10 converted IC50 or EC50 values. Ki had been better, but this is what we have for working with.
I’ll use RDKit for conversion of the SMILES into molecular objects and for fingerprinting of the molecules. so we’ll start by converting the SMILES to RDKit molecular objects. The pandastools module contain convenient functions for creating a molecule column. The Molecule column is rendered when shown in the jupyter notebook, which is convenient when exploring the data.

from rdkit import Chem, DataStructs
from rdkit.Chem import PandasTools, AllChem
SMILES Molecule
0 ClC=1C=C([C@]23[C@H]([C@@H]2CN(C)C)CNC3)C=CC1Cl Mol

As far as I know the ExcapeDB project used the Ambit toolkit and some of the SMILES may not be sanitizable by RDKit. I’ll check if some of the SMILES could not be converted by looking for None values in the Molecule column


All the SMILES appear to be converted for this dataset. Good.

Vectorizing the molecules

The Morgan fingerprint is a standard for fingerprinting molecules, but others can be tried as well. Calculating the fingerprints with RDKit allows us to convert the molecules into vectors of numbers that can be used for modelling. It’s fast to compute with a little helper function and as an example I’ll convert the fingerprint of molecule number 1 into a little image.

def mol2fp(mol):
    fp = AllChem.GetHashedMorganFingerprint(mol, 2, nBits=4096)
    ar = np.zeros((1,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp, ar)
    return ar
fp =mol2fp(Chem.MolFromSmiles(data.loc[1,"SMILES"]))
plt.matshow(fp.reshape((64,-1)) > 0)

Using the apply function of the pandas dataframe makes it easy to add a new column with the fingerprint vectors.

data["FPs"] = data.Molecule.apply(mol2fp)

Splitting and converting data to PyTorch tensors

PyTorch need to train on pytorch tensors, which are similar to Numpy arrays, but with some extra features such a the ability to be transferred to the GPU memory. But first we’ll devide the dataset into train, validation and test using scikit learn.

X = np.stack(data.FPs.values)
    (7228, 4096)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
y = data.pXC50.values.reshape((-1,1))
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.10, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train,  test_size=0.05, random_state=42)
#Normalizing output using standard scaling
scaler = StandardScaler()
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
y_validation = scaler.transform(y_validation)

I’ll remove some the bits with the lowest variance. They can be almost always on, or very seldom used. The VarienceThreshold class from scikit-learn is handy for just that. Just as for the normalization, we’ll compute the variance using the train set only, and use it to remove the same features from all the data sets.

# We'll remove low variance features
from sklearn.feature_selection import VarianceThreshold
feature_select = VarianceThreshold(threshold=0.05)
X_train = feature_select.fit_transform(X_train)
X_validation = feature_select.transform(X_validation)
X_test = feature_select.transform(X_test)
    (6179, 219)

Seems like most of the bits were seldom used. The threshold is a hyper-parameter of the entire pipeline for the model, and could be tuned. We’l move the datasets to the GPU and convert them to pytorch tensors.

# Let's get those arrays transfered to the GPU memory as tensors
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# If you don't have a GPU, buy a graphics card. I have for a long time used a 1060 GTX, which is not that expensive anymore.
X_train = torch.tensor(X_train, device=device).float()
X_test = torch.tensor(X_test, device=device).float()
X_validation = torch.tensor(X_validation, device=device).float()
y_train = torch.tensor(y_train, device=device).float()
y_test = torch.tensor(y_test, device=device).float()
y_validation = torch.tensor(y_validation, device=device).float()
    tensor([[0., 0., 0.,  ..., 0., 0., 0.],
            [1., 0., 0.,  ..., 0., 0., 0.],
            [2., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 1., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 1.],
            [0., 0., 0.,  ..., 0., 1., 1.]], device='cuda:0')
    torch.Size([6179, 219])

Pytorch work with datasets and dataloaders to feed the minibatches to the model during training, so we convert the data. It’s easy to create a dataset from the already created tensors. The batch size is set to 256 and the trainloader will shuffle the data when an epoch has been used.

from import TensorDataset
train_dataset = TensorDataset(X_train, y_train)
validation_dataset = TensorDataset(X_validation, y_validation)
train_loader =,
validation_loader =,

Good, with the training data ready it’s time for the fun part: Defining the network. In PyTorch this can be done by subclassing the nn.Module. The __init__ function will set up all the layers needed in the network, and the forward method will define how they are used in a syntax very similar to the Keras functional api: tensor = layer()(tensor). There’s no activation function for the final output, as this is a regression model, the linear activation is just fine.

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_rate, out_size):
        super(Net, self).__init__()
        # Three layers and a output layer
        self.fc1 = nn.Linear(input_size, hidden_size)  # 1st Full-Connected Layer
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.fc_out = nn.Linear(hidden_size, out_size) # Output layer
        #Layer normalization for faster training
        self.ln1 = nn.LayerNorm(hidden_size)
        self.ln2 = nn.LayerNorm(hidden_size)
        self.ln3 = nn.LayerNorm(hidden_size)        
        #LeakyReLU will be used as the activation function
        self.activation = nn.LeakyReLU()
        #Dropout for regularization
        self.dropout = nn.Dropout(dropout_rate)
    def forward(self, x):# Forward pass: stacking each layer together
        # Fully connected => Layer Norm => LeakyReLU => Dropout times 3
        out = self.fc1(x)
        out = self.ln1(out)
        out = self.activation(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.ln2(out)
        out = self.activation(out)
        out = self.dropout(out)
        out = self.fc3(out)
        out = self.ln3(out)
        out = self.activation(out)
        out = self.dropout(out)
        #Final output layer
        out = self.fc_out(out)
        return out

With the network class defined we can set the hyperparameters and create the model.

#Defining the hyperparameters
input_size = X_train.size()[-1]     # The input size should fit our fingerprint size
hidden_size = 1024   # The size of the hidden layer
dropout_rate = 0.80    # The dropout rate
output_size = 1        # This is just a single task, so this will be one
learning_rate = 0.001  # The learning rate for the optimizer
model = Net(input_size, hidden_size, dropout_rate, output_size)

The model can be moved to the GPU with the .cuda() method inherited from the nn.Module class. I’ve yet to look up how the module discovers the layers that were set as properties.

      (fc1): Linear(in_features=219, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=1024, bias=True)
      (fc3): Linear(in_features=1024, out_features=1024, bias=True)
      (fc_out): Linear(in_features=1024, out_features=1, bias=True)
      (ln1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ln3): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (activation): LeakyReLU(negative_slope=0.01)
      (dropout): Dropout(p=0.8, inplace=False)

For a loss function I’ll use the mean squared error. Adam is chosen for the optimizer.

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Training is a bit more handheld than in Keras. This gives flexibility to train the network layers in different ways, but here we just train it with all the data. For each epoch, the running_loss is zeroed, and the minibatches fetched from the train data in the DataLoader. We start each mini-batch loop by zeroing the gradients. All forward computations done from this point on, will be recorded and used to calculate the backward pass needed for optimzing the weights of the network. At the end of each epoch, some statistics is printed so that it’s possible to follow the training.

model.train() #Ensure the network is in "train" mode with dropouts active
epochs = 200
for e in range(epochs):
    running_loss = 0
    for fps, labels in train_loader:
        # Training pass
        optimizer.zero_grad() # Initialize the gradients, which will be recorded during the forward pa
        output = model(fps) #Forward pass of the mini-batch
        loss = criterion(output, labels) #Computing the loss
        loss.backward() # calculate the backward pass
        optimizer.step() # Optimize the weights
        running_loss += loss.item()
        if e%10 == 0:
            validation_loss = torch.mean(( y_validation - model(X_validation) )**2).item()
            print("Epoch: %3i Training loss: %0.2F Validation loss: %0.2F"%(e,(running_loss/len(train_loader)), validation_loss))
    Epoch:   0 Training loss: 1.22 Validation loss: 0.94
    Epoch:  10 Training loss: 0.54 Validation loss: 0.58
    Epoch:  20 Training loss: 0.44 Validation loss: 0.50
    Epoch:  30 Training loss: 0.40 Validation loss: 0.49
    Epoch:  40 Training loss: 0.36 Validation loss: 0.47
    Epoch:  50 Training loss: 0.34 Validation loss: 0.48
    Epoch:  60 Training loss: 0.32 Validation loss: 0.47
    Epoch:  70 Training loss: 0.31 Validation loss: 0.48
    Epoch:  80 Training loss: 0.30 Validation loss: 0.46
    Epoch:  90 Training loss: 0.28 Validation loss: 0.44
    Epoch: 100 Training loss: 0.27 Validation loss: 0.45
    Epoch: 110 Training loss: 0.27 Validation loss: 0.42
    Epoch: 120 Training loss: 0.25 Validation loss: 0.41
    Epoch: 130 Training loss: 0.25 Validation loss: 0.44
    Epoch: 140 Training loss: 0.25 Validation loss: 0.47
    Epoch: 150 Training loss: 0.23 Validation loss: 0.42
    Epoch: 160 Training loss: 0.22 Validation loss: 0.45
    Epoch: 170 Training loss: 0.23 Validation loss: 0.40
    Epoch: 180 Training loss: 0.22 Validation loss: 0.41
    Epoch: 190 Training loss: 0.21 Validation loss: 0.41

Before using the model for prediction, it must be set to evalulation mode, where the dropout layers are no longer active. We can calculate the prediction on the training set, validation set and the external test set.

model.eval() #Swith to evaluation mode, where dropout is switched off
y_pred_train = model(X_train)
y_pred_validation = model(X_validation)
y_pred_test = model(X_test)

Calculating the root mean square error can be done using the pytorch operators. The .item() method converts single element tensors to Python scalers for printing.

torch.mean(( y_train - y_pred_train )**2).item()
torch.mean(( y_validation - y_pred_validation )**2).item()
torch.mean(( y_test - y_pred_test )**2).item()

The error on the held out test set is actually lower than the validation set, although both are much worse than the train set. This indicates overfitting, even though the dropout rate was set high at 0.8.
Finally, lets evaluate the model visually. Before that is possible there seem to be a need to transport the tensor to cpu, detach it from gradient before we can can convert it to numpy and flatten it.

def flatten(tensor):
    return tensor.cpu().detach().numpy().flatten()
plt.scatter(flatten(y_pred_test), flatten(y_test), alpha=0.5, label="Test")
plt.scatter(flatten(y_pred_train), flatten(y_train), alpha=0.1, label="Train")
plt.plot([-1.5, 1.5], [-1.5,1.5], c="b")

If we want to predict an arbitrary molecule, we have to remember to use the VarianceThreshold object on the fingerprints and also reverse scale the prediction from the feed forward neural network. The scikit_learn objects are already fit and can be used here.

def predict_smiles(smiles):
    fp =mol2fp(Chem.MolFromSmiles(smiles)).reshape(1,-1)
    fp_filtered = feature_select.transform(fp)
    fp_tensor = torch.tensor(fp_filtered, device=device).float()
    prediction = model(fp_tensor)
    #return prediction.cpu().detach().numpy()
    pXC50 = scaler.inverse_transform(prediction.cpu().detach().numpy())
    return pXC50[0][0]

If you read this far, I hope you found it useful. There’s plenty to play around with to improve the model. The hyperparameters could be tuned (such as learning-rate, batch size, size of the hidden layers, activation functions, dropout or regularization with l2 etc.) But experiments with other fingerprints could be also be tried, as well as the feature selection step could be varied.

Share this Post


  1. I have worked through your excellent example of using pytorch and got to the end and tried the prediction for the given smiles string, but got the following error:

    ValueError: X has a different shape than during fitting.

    The shape of the fingerprint after reshaping is (4096, 1), but the feature_select.transform(fp) is where the error is happening. Any thoughts as to why?

    1. Thanks for your interest and comment. Scikit-Learn usually have samples as dimension 0, and features as 1, so the array need the shape (1, 4096). Check that the variable dimension is second in your reshape call .reshape(1,-1).

  2. Pingback: A simple LSTM based QSAR model in PyTorch | Cheminformania

  3. how come rdkit working with python 3 !!

    1. Not sure I understand your question correctly, RDKit has been python 3 for some time now (think 2014 or so)?

  4. Thank you for your blog, I have found it very interesting and informative. I would like to know, does RDKit calculate logD as a feature? I have been searching for it but can’t seem to find it. If so, which module/function in RDKit calculates it? If not, what could I use to calculate it?

    1. Thanks for your question. Although this has nothing to do with the blog-post, I believe you may be looking for You could probably get a much more detailed answer by asking your question on the RDKit discuss mailing list:

  5. Thank you for such a nice blog. There is one part I don’t quite understand.
    As regards to validation_loader, I don’t how and where you use the output for calculating the prediction metrics.

    1. Ah, yes, I see why you may be confused. I don’t use the validation_loader in this example, as the dataset is already available as X_validation and y_validation and fits in memory. So instead of looping through mini-batches from validation loader, I simply make the predictions in one go. If the dataset would not fit in memory during prediction it would have to be done on in mini-batches, or you’d switch to a framework like pytorch-lightning, you would need to provide the validation_loader from the data module.

  6. Any suggestion on how i can tune it?the examples I find focus on image classification and I am having some trouble to transfer the procedure to this model-

    1. Hi, I’ve been busy so maybe you figured it out. But there’s a couple of packages out there that can help you with a bit more than simply random tuning or grid search. As example there’s Optuna and the derivate Qptuna more targeted for molecule models. There’s also hyperopt and gpyopt. I used the later in this blogpost

      1. Thanks!I will check it!

  7. Hi, I really enjoyed this tutorial, thanks! I am wondering if you have the code somewhere available to download?

    1. No, I don’t have a ready to run code for download. This is for inspiration for your own scripts.

Leave a Comment

Your email address will not be published.