# Wash that gold: Modelling solubility with Molecular fingerprints….

Last blog entry the conversion between molecule and fingerprint was briefly touched upon. Now the fingerprints will be used as the basis for a simple attempt to build a solubility model using some data from the web. The file with some 1000 molecules as smiles and their aqueous solubility was downloaded fromhttp://www.vcclab.org/lab/alogps/

To prepare it for import into python a header was added to the file and the comments in the end double hashed ##.

`name smiles solubility `

`60-35-5 CC(N)=O 1.58`

....

Then this can be done in Python

```
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import linear_model
#First import data from the text file using NumPy's genfromtxt
data = np.genfromtxt("Solubility_data.smiles",delimiter=' ',names=True, dtype=None, comments='##')
#For each row, convert the Smiles to mols, mols to fingerprints and store in prepared python lists
molfps = []
solub = []
for row in data:
mol = (Chem.MolFromSmiles(row['smiles']))
fp = AllChem.GetMorganFingerprintAsBitVect(mol,2,nBits=1024) #2
bitstring = fp.ToBitString()
fp_vect = map(int, bitstring)
molfps.append(fp_vect)
solub.append(row['solubility'])
#Convert the python list into NumPy arrays for efficiency.
X = np.array(molfps)
y = np.array(solub)
print X.shape
```

This gives a matrix X, with all the molecules (samples) along dimension 1 and the morgan fingerprints along dimension 2 (features). The y is a vector with the solubility data.

Now deep learning neural networks will be used to… Joke aside, 95% of machine learning problems do not require deep learning ;-), so a simple solution will be tried first.

Multiple Linear regression (MLR) is simply an expansion of the well know linear formula **y=a*x + b.**

y = b1*x1 + b2*x2 + … + bi*xi + b0

The parameters b1 .. bi will be fit so the features of each sample (x1 to xi, the morgan fp bits) can predict the solubility (y). In python it can be done with scikit-learn in three lines:-)

```
#Load the LinearRegression from sklearn.linear_model.
estimator = linear_model.LinearRegression()
#Fit and score
estimator.fit(X,y)
print estimator.score(X,y)
```

This reports a correlation coefficient around 0.96. With 1024 features and 1311 molecules there’s a high risk of over fitting, which can be tested with cross-validation. Scikit-learn also contain modules for exactly that.

```
#Cross Validate
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
estimator.fit(X_train,y_train)
print estimator.score(X_train,y_train)
#Out[95]: 0.98429997844791894
print estimator.score(X_test,y_test)
#Out[94]: -2.7528207967268459e+24
```

Obviously something is wrong, it should be between 0 and 1. Maybe some round off error??? A fast plot show the bad correlation:

```
import matplotlib.pyplot as plt
plt.scatter(estimator.predict(X_test),y_test)
plt.show()
```

Simple multiple linear regression using 1024 molecular bits needs way more than 1311 molecular samples to avoid over fitting. However, next time I’ll try a special trick….

Best Regards

Esben Jannik Bjerrum

Pingback: Molecular neural network models with RDKit and Keras in Python | Wildcard Pharmaceutical Consulting

This is a commom problem: datasets are small because experiments are expensive, and because results are not always shared; and there are hundreds of molecule descriptors.

We have had success with gradient boosted trees, and with SVMs. But both need care to avoid overfitting.

Nice to hear that you have had success with gradient boosted trees and SVMs. In the follow up blog I show what can be done about the problem with tuning of regularization: Safer fitting through regularization I have since I wrote these blog posts done a lot of experimenting with Neural Networks. Even though they are very complex the choice of architecture seem to protect somewhat against over-fitting. Even in the absence of regularization techniques like drop-out and weight decay. But data is the king, so my attention is currently focused on domain specific data augmentation techniques like https://arxiv.org/abs/1703.07076 and https://arxiv.org/abs/1710.01927

Very interesting blog!

Why do you use this line of code: fp_vect = map(int, bitstring)?

I read that the map function uses the same function on every item of an iterable object (a string in this case). This line of code returns a match object, but I do not really get the advantage of doing thuis.

Thanks in advance.

Hi, thanks for the feedback. Its been some time since I wrote the blog post, but the map seem to be used a kind of casting, where the string of bits is converted to a list of integers. A list of list of integers are easily converted to a numpy array later. There may be a lot of other ways to get from the fingerprint rdkit.DataStructs.cDataStructs.ExplicitBitVect to the numpy array, maybe a direct cast is actually easier e.g. fp_vect = list(fp)?

Thank you for the reply. I figured out why you converted the bitstring to a list of zeros and ones.

I guess you used python 2.x for the code? I use python 3.x and then the code does not work, but I found a simple solution:

fp_vect = [*map(int, bitstring)]