Fetch the Sherif! I found a fingerprint!

Esben Jannik Bjerrum/ November 23, 2015/ Blog, Cheminformatics, RDkit/ 0 comments

The headline is a bit misleading as it’s not fingerprints of criminals, but chemical fingerprints. Chemical fingerprinting is a way of converting drawn molecules into streams of bits, 0 and 1’s. An old fingerprint type are the MACCS keys, which was developed by former MDL as a fast way of do substructure screening in molecular databases. The public version comprises 166 keys, that is 166 0’s and 1’s, where each key correspond to a particular molecular feature, such as existence of a carbonyl group (key 154:(‘[#6]=[#8]’,0), # C=O in the RDkitimplementation). Another type of fingerprints available in RDkit is the morgan type fingerprint, which is a circular fingerprint. Each atom’s environment and connectivity is analysed up to a given radius and each possibility is encoded. The very many possibilities are usually compressed to a predefined length such as 1024 via hashing algorithm. Circular fingerprints are thus systematic explorations of atom types and connectivity of the molecule, whereas the MACCS keys are dependent on the predefined molecular features to be matched.
Generating fingerprints with RDkit in Python is quite easy, as illustrated by the short demo below.Molecular_fingerprint
Generating fingerprints with RDkit in Python is quite easy, as illustrated by the short demo below.

#First a couple of imports
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import MACCSkeys
from rdkit import DataStructs


# Make two test molecules
mol = Chem.MolFromSmiles('CCCN')
mol2 = Chem.MolFromSmiles('CCCO')


# make a MACCS fingerprint and print the bits
fp1 = MACCSkeys.GenMACCSKeys(mol)
print fp1.ToBitString()

This gives:


We can do the same with the Morgan type. Morgan fingerprints are in the AllChem module.

fp1_morgan = AllChem.GetMorganFingerprint(mol,2)

The length is VERY long, as theres a Huge number of possible chemical combinations and bonding for just 2 bonds out!
print fp1_morgan.GetLength()
Out[15]: 4294967295
But it can be hashed to a defined length.

fp1_morgan_hashed = AllChem.GetMorganFingerprintAsBitVect(mol,2,nBits=1024)
Out[22]: '0000000000000000000000100000000001000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000'

Mostly fingerprints are used to evaluate the similarity between compounds by comparing the various off and on bits. A much used similarity measure is the Tanimoto Coefficient. The fingerprints for molecule 2 was generated in a similar way as for molecule 1. Rdkit has an inbuilt function to calculate the tanimoto coefficient. The default similarity in the fingerprint similarity function is the tanimoto coefficient.

First calculated with the MACCS keys.

In [27]: DataStructs.FingerprintSimilarity(fp1,fp2)
Out[27]: 0.45

Then calculated using the Morgan fingerprints.

In [29]: DataStructs.FingerprintSimilarity(fp1_morgan_hashed,fp2_morgan_hashed)
Out[29]: 0.3333333333333333

Not completely the same, but they are small molecules so varying one atom out of 4 is reflected in the low similarity.
So why is this useful? Being able to treat molecules as bit strings in a structured way opens up the possibilities for a lot of chemoinformatic work-flows and applications such as QSAR analysis and fast searching of databases. Because each element in the structure is mapped to the fingerprint it can form the basis for building QSAR models which often uses similarity.  As example, the kernel functions employed in a SVM model. However, the conversion is (almost) a one way road: Its not always possible to go the other way round, from having a fingerprint to find the originating molecule. RDkit contains a couple of different molecular fingerprint and a collection of descriptors, They do describe different aspects of the molecule and can perform differently depending on the application. This is however beyond the cope of this short introduction.

Best Regards
Esben Jannik Bjerrum
Share this Post

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>