Peeking into the chemical space using free tools

Esben Jannik Bjerrum/ December 19, 2016/ Blog, Cheminformatics, RDkit/ 2 comments

As covered before, chemical space is huge. So it could be nice if this multidimensional molecular space could be reduced and visualized to get an idea about where how query molecules relate to one another. This small tutorial show a simple example of how this could be done with the PCA decomposition from scikit-learn and molecular fingerprints calculated with RDkit.
First the dataset is created from 50000 molecules using RDKit, the usual hashing to a set bit length is skipped, and instead only the bits where theres more than 0.001 of the bits set is kept.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import rdBase, DataStructs
import glob
#Dictionary Based Fingerprints
import sklearn.feature_extraction
files = glob.glob('/home/esben/Zinc_files/*sdf')
def getMorganAsDict(mol):
d = {}
return d
for sdf in files[0:1]:
print "Processing %s"%sdf
sup = Chem.SDMolSupplier(sdf)
for moli in range(0,50000):
mol = sup[moli]
d = getMorganAsDict(mol)
print "Error in Conversion"
v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
##For modelling
X = v.transform(fps)
#Remove most sparse
from sklearn.feature_selection import VarianceThreshold
cutoff = 0.999
sel = VarianceThreshold(threshold=(cutoff * (1 - cutoff)))
X2 = sel.fit_transform(X)
print X2.shape
#Pickle v, and X.
import pickle

This gives a dataset with the most populated bits of the morgan fingerprints.

#Make a PCA and plot:
import pickle
Xo = pickle.load(file('trainingset.pickle','r'))
#Convert to dense array
X = Xo.toarray()
del Xo
print X.shape
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(copy=False)
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=3) #Dense data required
print pca.explained_variance_ratio_
#RandomizedPCA [ 0.00329575  0.00263305  0.00252065]
Xred =pca.transform(X)

Now the (50000, 4028) matrix has been reduced to a 50000×3 matrix, while trying to preserve most of the variance of the data. Lets try and plot it together with some other datasets which where made to fit.

#Read DHFR test set
Xdhfr = scaler.transform(pickle.load(file('dhfr_testset.pickle','r')).toarray())
ydhfr = pickle.load(file('dhfr_testset_y.pickle','r'))
thr = 1e-2
enc_dhfr0 = pca.transform(Xdhfr[ydhfr < thr])
enc_dhfr1 = pca.transform(Xdhfr[ydhfr > thr])
from matplotlib import pyplot as plt
def star_scatter(X,Y,size,color):
plt.scatter(X, Y, c=color, s=size*2,linewidths=0)
plt.scatter(X, Y, c='w', s=size,  linewidths=0)
plt.scatter(X, Y, c=color, s=max((size*4,20)),linewidths=0, alpha=0.2)
star_scatter(Xred[:,0], Xred[:,1],1,'w')
star_scatter(enc_dhfr1[:,0], enc_dhfr1[:,1],5,'g')
star_scatter(enc_dhfr0[:,0], enc_dhfr0[:,1],10,'r')
##Change Color of Canvas
ax = plt.gca()
ax.set_axis_bgcolor((0.05, 0, 0.15))
plt.xlabel('PC 1')
plt.ylabel('PC 2')

I made it look a bit more “space like” on purpose. The green “stars” are the dihydrofolate ligands, with the most active highlighted in red. There seem to be preference for a certain area of the plot, with the high actives preferring a slightly other spot than the bulk of the ligands. However, the explained variance is quite low for the components as the first two components only explain less than 1% of the total variance.


Share this Post


  1. how can we get the file “dhfr_testset_y.pickle”

    1. Hi Dalong,
      Thank you for the interest in the Blog post. The DHFR test set is derived from Sutherlands QSAR datasets,, which I had used long time ago. The fingerprints were calculated from the molecules as for the PCA training set, and the ydhfr is a numpy array or vector with the IC50 values. I hope this clarifies the issue, otherwise I can send you the pickle files.
      Best Regards

Leave a Comment

Your email address will not be published.