Master your molecule generator 2. Direct steering of conditional recurrent neural networks (cRNNs)

Esbenbjerrum/ November 12, 2019/ Blog, Cheminformatics, Neural Network, RDkit, Science/ 27 comments

Long time ago in a GPU far-far away, the deep learning rebels are happy. They have created new ways of working with chemistry using deep learning technology and condemed “the old ways” to be a provider of SMILES strings. But then something appears in the deep space …
In a previous blog-post i’ve covered how to create a molecular encoder and using the generated latent space for control of the generation of novel molecules. I’ve also previously covered the unbiased version of the generator in what now seem like long time ago, so the Blogpost may seem a bit technically outdated. Together with Panagiotis-Christos Kotsias, a bright, young graduate scientist and a couple of other researches, we’ve investigated how we can use pre-calculated descriptors and activity predictions to get closer to the holy grail of direct inverse QSAR. The details were recently published as a Chemrxiv preprint and the code and two pre-trained models made available on GitHub. The architecture is in a way related to existing auto- and heteroencoder architectures, but the encoder part is not created via deep learning. The molecular encoding is done through existing methods in the cheminformatics toolkit RDKit, as illustrated below. I kind of like the way we can combine the existing engineered methods with new deep learning technology for the creation of a hybrid method and I hope that this combination in the future will bring along a new balance of the fo^D^D source. (My teenage sons would scorn me for misuse of memes, but I couldn’t resist….).
The network will adjust the output to match these inputs, as example it will probably not put a lot of hydroxy groups (O) in a molecule we just told it has a large logP, but rather put a higher probability on carbons and thereby be able to attain a lower overall loss. If the amount of information is high, like if the conditional RNN is fed a structural fingerprint, the number of possible molecules will be low, and the network will be closer to a perfect autoencoder 1:1 molecular mapping. But if the amount of chemical information is lower, like the five descriptors and properties in the tutorial below, the network will output orders of magnitude more molecules and be intermediate in performance between autoencoders and unbiased RNN molecular generators. It can also be pre-trained on predictions from a QSAR model and thus solve the inverse QSAR task directly. The code in the GitHub repository contain a pre-trained model, and in this blog post I’ll show how to use it to generate novel molecules.
Illustration of how to change an autoencoder architecture into a conditional RNN

Installation of deep drug coder

Follow the instructions about how to install the module described in the GitHub repository. In short, Git clone https://github.com/pcko1/Deep-Drug-Coder.git, create a conda environment from the provided .yml file, activate you conda or virtualenv environment, cd Deep-Drug-Coder, python setup.py install. Then it should be possible to load the ddc_v3 class.
The code seem to be break with the recent upgrade of numpy to 1.17, but it has been tested to work with 1.16

import numpy as np
import pickle
np.__version__
    '1.16.5'
from ddc_pub import ddc_v3 as ddc
    Using TensorFlow backend.

Using the pre-trained models

The git project should come with two pre-trained models. The “phys-chem” pcb model is steered using a basic selection of molecular descriptors as well as a feature linked to the probability of being active in a DDR2 QSAR model. The new version of DDC includes the vectorizer from molvecgen directly, but in order to use some of the previous trained models, the molvecgen package from https://github.com/EBjerrum/molvecgen also need to be installed in the repository.

import molvecgen

Importing the molecule from the zip file (without the extension!), will load the needed files and build the decoder model and show a summary of the loaded model.

model_name = "Deep-Drug-Coder/models/pcb_model"
model = ddc.DDC(model_name=model_name)
    Initializing model in test mode.
    Loading model.
    'mol_to_latent_model' not found, setting to None.
    /projects/mai/kfxl284/envs/ddc_env/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/base.py:306: UserWarning: Trying to unpickle estimator StandardScaler from version 0.20.2 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
      UserWarning)
    /projects/mai/kfxl284/envs/ddc_env/lib/python3.6/site-packages/Keras-2.2.5-py3.6.egg/keras/engine/saving.py:310: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
      warnings.warn('No training configuration found in save file: '
    Loading finished in 8 seconds.
    Model: "model_1"
    __________________________________________________________________________________________________
    Layer (type)                    Output Shape         Param #     Connected to                     
    ==================================================================================================
    Latent_Input (InputLayer)       (None, 7)            0                                            
    __________________________________________________________________________________________________
    Decoder_Inputs (InputLayer)     (None, 142, 35)      0                                            
    __________________________________________________________________________________________________
    latent_to_states_model (Model)  [(None, 256), (None, 18432       Latent_Input[0][0]               
    __________________________________________________________________________________________________
    batch_model (Model)             (None, 142, 35)      1364771     Decoder_Inputs[0][0]             
                                                                     latent_to_states_model[1][0]     
                                                                     latent_to_states_model[1][1]     
                                                                     latent_to_states_model[1][2]     
                                                                     latent_to_states_model[1][3]     
                                                                     latent_to_states_model[1][4]     
                                                                     latent_to_states_model[1][5]     
    ==================================================================================================
    Total params: 1,383,203
    Trainable params: 1,378,595
    Non-trainable params: 4,608
    __________________________________________________________________________________________________
    None

Calculating conditions from an existing molecule

Before we can generate molecules we need to feed the target conditions. It is a good idea to initially take the calculated properties from an existing molecule to get values that are realistic and in congruence, as unrealistic and contradictory settings will lead to problems for generation of molecules. The molecules will not meet the targets and the validity percentage of the generated molecules will eventually collapse. So a we write a small function that returns a list with the conditions that the loaded model was trained on. Lets use the classical aspirin molecule but try to find compounds that are DRD2 active with a probability of 0.99. This is entirely for illustrative purposes, it would probably be better to start with a compound closer to actives in the intended activity model, but lets see how it goes. Six of the properties for the model were calculated with standard RDKit functions, whereas the last one is the active probability from a DRD2 QSAR model.

qsar_model_name = "Deep-Drug-Coder/models/qsar_model.pickle"
with open(qsar_model_name, "rb") as file:
    qsar_model = pickle.load(file)["classifier_sv"]
    /projects/mai/kfxl284/envs/ddc_env/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/base.py:306: UserWarning: Trying to unpickle estimator SVC from version 0.20.2 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
      UserWarning)
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, rdMolDescriptors, AllChem, QED
from rdkit.Chem.Draw import IPythonConsole
#We suppres stdout from invalid smiles and validations
from rdkit import rdBase
rdBase.DisableLog ( 'rdApp.*')
def get_descriptors(mol):
    logp  = Descriptors.MolLogP(mol)
    tpsa  = Descriptors.TPSA(mol)
    molwt = Descriptors.ExactMolWt(mol)
    hba   = rdMolDescriptors.CalcNumHBA(mol)
    hbd   = rdMolDescriptors.CalcNumHBD(mol)
    qed   = QED.qed(mol)
    
                    
    # Calculate fingerprints
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=2048)
    ecfp4 = np.zeros((2048,))
    DataStructs.ConvertToNumpyArray(fp, ecfp4) 
    # Predict activity and pick only the second component
    active_probability = qsar_model.predict_proba([ecfp4])[0][1]
    return [logp, tpsa, molwt, qed, hba, hbd, active_probability]
mol = Chem.MolFromSmiles("c1cccc(OC(=O)C)c1C(=O)O")
display(mol)
conditions = get_descriptors(mol)
conditions


[1.3100999999999998,
63.60000000000001,
180.042258736,
0.5501217966938848,
3,
1,
9.431141115603303e-06]

Sampling molecules with similar properties

We’ll reset the DRD2 activity probability to 0.9 to search for active ligands with similar properties as aspirin.

conditions[-1] = 0.9

We’ll sample the deep drug coder multiple times with the same conditions and examine the diversity of the output. Can we really match the input conditions and will our QSAR model for DRD2 predict the compounds as being active. The molecular weight from the seed compound may be a bit low for the DDC to solve the task, but lets see what happens.

target = np.array(conditions)
smiles_out, _ = model.predict(latent=target, temp=1)
Chem.MolFromSmiles(smiles_out)


Running it a few times show some “interesting” molecules, but also sometimes gives some validation errors. We’ll use the batch mode to generate a molecule set and check the properties. The model.batch_input_length property will specify the number of predicted smiles in batch mode, but resetting it will trigger a rebuild of the model. It’s default 256 and setting it to large values will put strain on the GPU memory and sometimes leads to delays, so it may be faster to predict a couple of times 256.

#model.batch_input_length = 256
smiles_out = []
for i in range(4):
    smiles, _ = model.predict_batch(latent=target.reshape(1,-1), temp=1.0)
    smiles_out.append(smiles)
smiles_out = np.concatenate(smiles_out)
smiles_out.shape
    (1024,)
mols = [Chem.MolFromSmiles(smi) for smi in smiles_out]

We’ll just check the molecules with sanitation to ensure they are well accepted by RDKit

def sanitize(mol):
    try:
        Chem.SanitizeMol(mol)
        return mol
    except Exception as e:
        print(e)
        
sani_mols = [sanitize(mol) for mol in mols if mol != None]
    Sanitization error: Explicit valence for atom # 12 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 13 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 11 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 13 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 12 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 13 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 6 N, 5, is greater than permitted
len(sani_mols)/len(smiles_out)
    0.80859375

Approximate 75% validity, this may indicate conditions that are slightly self contradictory. God self-consistent conditions are usually above 85%. Let’s compute the properties and see if they match our targets

properties = [get_descriptors(mol) for mol in sani_mols if mol != None]
import pandas as pd
target_names = ["logp", "tpsa", "molwt", "qed", "hba", "hbd", "active_probability"]
data = pd.DataFrame(properties, columns=target_names)
data.head()
logp tpsa molwt qed hba hbd active_probability
0 1.35750 78.02 197.164046 0.301813 3 1 0.667666
1 0.91848 64.33 188.058577 0.665338 4 1 0.000024
2 0.19990 67.48 181.121512 0.595266 3 2 0.004157
3 2.00240 69.94 194.026232 0.444074 4 1 0.000183
4 1.26688 60.07 176.093072 0.655230 3 1 0.020928

Plotting the distributions of the samples molecules

target_dict = {t[0]:t[1] for t in zip(target_names, target)}
axes = data.hist(bins=25,figsize=(8,6))
for i, ax in enumerate(axes.flatten()):
    title = ax.title.__dict__["_text"]
    if title:
        ax.axvline(x=target_dict[title], color='r', linestyle='dashed', linewidth=2)


The properties seem to follow distributions around the target (red line), except for QED and active_probability. The jump from Aspirin properties to properties of a DRD2 active molecule may be too large. We’ll try with a DRD2 active molecule as seed. Here we’ll choose the well-known anti-psychotic, chlorpromazine, which acts as an antagonists on the DRD2 receptor.

Trying another molecular seed

mol2 = Chem.MolFromSmiles("CN(C)CCCN1C2=CC=CC=C2SC3=C1C=C(C=C3)Cl")
display(mol2)
conditions = get_descriptors(mol2)
print(conditions)
target = np.array(conditions)


[4.894400000000004, 6.48, 318.095747288, 0.7918053413291345, 3, 0, 0.9939953772318398]
Not surprisingly it is predicted as being active by the QSAR model. There’s a good chance that it was already a part of the training set. Next we create the SMILES strings, convert them to molecules, sanitize and compute the percentage validity

smiles_out = []
for i in range(4):
    smiles, _ = model.predict_batch(latent=target.reshape(1,-1), temp=1)
    smiles_out.append(smiles)
smiles_out = np.concatenate(smiles_out)
  
mols = [Chem.MolFromSmiles(smi) for smi in smiles_out]
sani_mols = [sanitize(mol) for mol in mols if mol != None]
len(sani_mols)/len(smiles_out)
    0.900390625

This is a much better validity range, but we also used an exact seeding condition from an existing molecule. Lets quickly look at the property distributions of the resulting molecules.

properties = [get_descriptors(mol) for mol in sani_mols if mol != None]
data = pd.DataFrame(properties, columns=target_names)
target_dict = {t[0]:t[1] for t in zip(target_names, target)}
axes = data.hist(bins=25,figsize=(8,6))
for i, ax in enumerate(axes.flatten()):
    title = ax.title.__dict__["_text"]
    if title:
        ax.axvline(x=target_dict[title], color='r', linestyle='dashed', linewidth=2)


This is again a showing more or less narrow distributions around the properties, and theres a much larger ratio of predicted actives. We can try and browse the molecules with a small interactive ipywidget. WordPress doesn’t support thos, so it’s just an image in this blog-post, but quite useful in a Jupyter notebook.

from ipywidgets import interact
def showmol(i):
    display(sani_mols[i])
    
_ = interact(showmol, i = range(len(sani_mols)))

Adjusting one of the properties independently

LogP for the seed compound was close to 5, but lets assume the scenario that we want to lower the logP value. We take the seed conditions from before but lower the target logP to 3.

conditions[0] = 3
target = np.array(conditions)
smiles_out = []
for i in range(4):
    smiles, _ = model.predict_batch(latent=target.reshape(1,-1), temp=1)
    smiles_out.append(smiles)
smiles_out = np.concatenate(smiles_out)
mols = [Chem.MolFromSmiles(smi) for smi in smiles_out]
sani_mols = [sanitize(mol) for mol in mols if mol != None]
len(sani_mols)/len(smiles_out)
    0.9599609375

Validity rate seem to be even higher. Good, and how does the property distribution and molecules look like?

properties = [get_descriptors(mol) for mol in sani_mols if mol != None]
data = pd.DataFrame(properties, columns=target_names)
target_dict = {t[0]:t[1] for t in zip(target_names, target)}
axes = data.hist(bins=25,figsize=(8,6))
for i, ax in enumerate(axes.flatten()):
    title = ax.title.__dict__["_text"]
    if title:
        ax.axvline(x=target_dict[title], color='r', linestyle='dashed', linewidth=2)


This is interesting. We lower the logP to 3 and suddenly we get better validity and higher proportion of actives. This could be because that the distribution of the actives used to train the QSAR model was centered around a logP of 3 and the model has thus learned from a lot more molecules close to this logP value (distribution show in the Chemrxiv preprint Figure 2). We can take a brief look at the molecules with an interaction widget as before.

_ = interact(showmol, i = range(len(sani_mols)))

Summary

In this blog-post it was demonstrated how to use a pre-trained deep-drug-coder model to generate molecules with specified properties. We saw how it can be important to select the right properties as some tasks can’t be solved completely due to contradictory conditions selected. Selecting an existing active compound to derive the seed conditions can provide larger amounts of predicted actives, while at the same time allow for adjustment of some of the properties, although it may be easier as the property was changed to a region that was highly populated in the training phase. This methodology for steering molecular generation from pre-specified properties such as logP and molecular weight makes it much easier for a human-in-the-loop to select what regions could be interesting for generation of datasets in a given drug discovery project context. The method can thus be used to generate huge datasets of novel molecules that can then be processed through further filters and methods (e.g. in-silico filters, ADME, tox, docking, synthesizability assesment)

Happy Deep Learning

Panagiotis-Christos and Esben

Share this Post

27 Comments

  1. Thank you for illustrating this work step-by-step. I recently read your paper and find it fascinating!

    I’m trying to reproducing this tutorial. Everything looks fine until “sani_mols = [sanitize(mol) for mol in mols if mol != None]”, where the output doesn’t look your output.

    Here, the RDKIT would complain about many error of SMILES, and eventually there is a segmentation fault. Do you have any suggestions to avoid it? Thank you in advance to look into this.

    The following is the error messages:
    RDKit ERROR: [22:30:41] Can’t kekulize mol. Unkekulized atoms: 0 1 2 3 6 12 13
    RDKit ERROR:
    [22:30:41] SMILES Parse Error: syntax error while parsing: c1[nH]nc(C2CCCN(C)C2)o1[N=N
    RDKit ERROR: [22:30:41] SMILES Parse Error: syntax error while parsing: c1[nH]nc(C2CCCN(C)C2)o1[N=N
    [22:30:41] SMILES Parse Error: Failed parsing SMILES ‘c1[nH]nc(C2CCCN(C)C2)o1[N=N’ for input: ‘c1[nH]nc(C2CCCN(C)C2)o1[N=N’
    RDKit ERROR: [22:30:41] SMILES Parse Error: Failed parsing SMILES ‘c1[nH]nc(C2CCCN(C)C2)o1[N=N’ for input: ‘c1[nH]nc(C2CCCN(C)C2)o1[N=N’
    [22:30:41] Can’t kekulize mol. Unkekulized atoms: 0 1 2 3 7 8 9 10 11

    RDKit ERROR: [22:30:41] Can’t kekulize mol. Unkekulized atoms: 0 1 2 3 7 8 9 10 11
    RDKit ERROR:
    [22:30:41] Can’t kekulize mol. Unkekulized atoms: 0 1 2 3 4 5 10 12 13

    RDKit ERROR: [22:30:41] Can’t kekulize mol. Unkekulized atoms: 0 1 2 3 4 5 10 12 13
    RDKit ERROR:
    [22:30:41] Can’t kekulize mol. Unkekulized atoms: 1 2 3 4 6 7 8 9 11

    RDKit ERROR: [22:30:41] Can’t kekulize mol. Unkekulized atoms: 1 2 3 4 6 7 8 9 11
    RDKit ERROR:
    [22:30:41] SMILES Parse Error: extra close parentheses while parsing: c1cc(CCN2CCN=C2)[nH]o1)=O
    RDKit ERROR: [22:30:41] SMILES Parse Error: extra close parentheses while parsing: c1cc(CCN2CCN=C2)[nH]o1)=O
    Segmentation fault (core dumped)

    1. RDKit seemingly chokes on some of the SMILES, you may want to be more careful by putting the SMILES conversion into a try – except statement rather than the simple list-comprehension I use.

  2. Pingback: GraphINVENT for molecular graph generation | Cheminformania

  3. First of all, great manuscript.
    Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.
    I have been a fan cheminformania.com for some time!

    I have managed to get your code to work pretty seamlessly and also to combine aspects with some of your other blog posts.

    However, I was wondering if it is possible to train your pcb model against a different set of descriptors?
    apologies if this is something simple and covered elsewhere.

    Thanks,
    Charles

    1. Hi Charles,
      Great to hear that you could get the code examples to work :-), and yes, you can train your pcb model against a different set of descriptors. Of course your millage may vary, depending on how many you choose and their relationship with structural features.

      Best Regards
      Esben

  4. Dear Professor
    I’m a student for China in Nanjing University. I find this work of yours very helpful for my field as well, so I am trying your code. But I am having problems that my code is not running and I get low memory errors. So I would like to ask what kind of computer configuration is required to run your code?

    1. Hi Zhang,
      Thanks for your interest in our work. I’m sorry to hear that you have memory issues, but you are not telling me if it’s GPU or CPU memory. As far as I remember we worked with 32GB Ram and 8GB GPU ram. You can try to set the batch size down, that will lower the memory consumption.

      Best Regards
      Esben

      1. Dear Professor

        Thank you so much for writing back! But I tested your github code the way you did, and I found that in the last step model.fit this still doesn’t run out, and my batch_size is set to 8. My computing device cpu and gpu are 16GB and rtx3060 6GB respectively. One more question is whether the cuda and cudnn versions you require to be installed will not match my graphics card version? What do I have to do to run your code?

        1. Sorry to hear that you still have issues. I don’t know exactly what versions of cudnn and cuda are compatible with your graphics card. Isn’t the information available at Nvidias homepage?

  5. Dear Professor
    I feel like I really encountered a bottleneck in reproducing your code, no matter how I change the version of tensorflow-gpu, when I test the PCB model, it will report an error like “TypeError: Cannot convert 0.5 to EagerTensor of dtype int64 “; when testing the FPB model, it will have an error like “AttributeError: ‘list’ object has no attribute ‘shape'”, I feel too uncomfortable, if possible, I really I really want the first author of this work to communicate with me by email, I really need your help! my email is zl16035056@163.com.

    Best Regards
    zhang

    1. I’ll try to let Panagiotis know that you have questions. But you must promise to come back here, and comment if you find a solution, so that other readers/users could learn. It has been some years since the code was released (3 years, but thats ancient in the ML/AI world), and there can have been several updates to tensorflow that could make it problematic to run due to API changes or different defaults. What tensorflow versions have you tried? The model is not overly complicated and written using tensorflow Keras, and it could be that it would be necessary to update the code. I.e. you get an error that 0.5 (a float), can’t be converted to a tensortype of dtype64 (I don’t recall that we used eager execution?), maybe the default tensortype has changed, and now it need a dtype declaration to be of correct type?), the second error seems to be a list that is not properly cast into a numpy array or tensorflow tensor, which can easily be done in the code with np.array(variable_name). Do you get different errors when you use different versions of tensorflow?

      1. Dear Professor

        I appreciate you writing back! If I solve the problem, I’ll definitely post a comment here for other users to follow. I have tried several tensorflow versions (2.0.0, 2.2.0, 2.4.0, 2.6.0, 2.7.0) but they all report errors like “Cannot convert 0.5 to EagerTensor of dtype int64”. I also used some common forced type conversion methods like np.array, but they still give me this error. I’ve tried some common methods, but I still can’t solve this error well, I sincerely hope to get your help!

        Best Regards
        zhang

        1. Ah, OK, the code development predates September 2019 and is thus tensorflow 1. There’s a comment in the ddc_env.yml what version was used for the publication (1.12), and it will more likely work with other versions of TF 1 rather TF2. I suggest you revert all your changes (or checkout a fresh branch) and install tf 1.12 in your conda environment and see how it goes.

          1. Thank you so much for writing back! But the 1.12.0 version of tensorflow still can’t run the code and will have more errors.

          2. I’m sorry to hear that isn’t working either. What kind of errors? What did you do? How did you install it? How do you run it? What kind of data are you using? It’s hard to understand the problem and help you when you are not giving details.

  6. Dear professor:
    I’ve solved my problem myself and I’m posting here the reason I found it. In your code you used CudnnLSTM, I found that this function gives some errors when used in different tensorflow-gpu versions, so I changed CudnnLSTM to LSTM so that I can run the code. This is just a very simple bug, I hope it will be helpful.

    1. Very nice to hear, great find. Yes, now that you remind me, I remember that we used the CudnnLSTM network as it was accelerated with cudnn. But the downside, as you saw, is that it is probably more picky with regard to the GPU, nvidia driver, cuda, cudnn, cuda, tensorflow, keras, python stack. Thanks for coming back and commenting, others may find this useful.

  7. Dear professor:
    I have solved my problem and was able to duplicate your results very well. This work of yours is well worth my time to study. But I have a few more questions with this work, while debugging your vectorizers.py code, I observed that each smiles can be converted to each one-hot matrix form, and you converted it based on each character in the smiles. But here I have a few question, for example, a string like ‘c1c(-c2c(Br)sc(Br)c2)c(Br)sc1Br’, the atom Br will be split into B and r during the conversion, so the chemical information of this atom will be destroyed, how do you see this problem?

    1. Hi Zhang,
      I’m not (yet?) a professor, but thanks for the politeness. With regard to tokenization and encoding schemes, there are a few different options. For this application we chose a simple character based one-hot encoding, which works fairly well. Other projects have used atom-based encoding, so that e.g. Br, gets encoded as ‘Br’ and not ‘B’, ‘r’, where the first ‘B’ would be similar to e.g. encoding boron as ‘B’. In practice the difference is very small for well-trained networks, it’s kidsplay for an LSTM network to learn that there is a difference between the ‘B’ followed by an ‘r’, in comparison to the Boron case ‘B’. Other researchers has gone even further, and used SMILES aware byte-pair encoding schemes e.g. https://pubs.acs.org/doi/10.1021/acs.jcim.0c01127, where they find common stretches of characters to encode for tokens. A masters project I supervised showed little benefit of these schemes when it comes to accuracy: https://odr.chalmers.se/server/api/core/bitstreams/22f1d0c7-b2c8-4300-b464-9b4227fddd7e/content, although they did increase training and inference speeds. Today, I would probably use the SmilesAtomTokenizer from PySMILESUtils and an embedding layer rather than one-hot encoding of a character based one.

  8. Hi, thank you for your great work. I find it very helpful to my project. I have managed to make it work, even there was a TF-cuda-cudnn version problem on my 3080..
    One small question. I am trying to apply the encoder and decoder part in my work. I have tried the pre-trained chembl model provided in the repo, but it seems to get 7 out of the 25 test smiles from my data were decoded different from before. Is this a reasonable decoding results?
    And I have not figured out how to train my own pre-trained model. This might be my miss… It seems that I can train the model by the code part ” To *train* a blank model with encoder (autoencoder)”. But I can’t figure out what the param y means. What I have is only a list of smiles? Is there any instruction that can help me train my own pre-trained model to do the encode and decode? Thanks a lot!

    1. Hi Anna,
      Thanks for your interest in our work, I’m glad that it is helpful for you. I’m not sure in what setup you tested the Chembl model, regeneration from fingerprints? 7/25 sounds a bit lower than our found reconstruction percentage as far as I remember, but do check the original preprint or publication. If it’s lower, it could be that your molecules are not “ChemBL-like”, e.g. bigger, unusual combinations of substructures etc., and you would need to train or fine-tune a specific model for your task.
      I’m not sure what code-snippet you are talking about, but y usually refers to the output of the model, so I would guess it’s you list of SMILES. Your input X, would then be the fingerprints or calculated descriptors of the molecules, as we train out model to go from descriptors or FP to SMILES.

  9. Dear Esben Jannik Bjerrum
    The code in your github repository as well as the code in your blog post are both generating this error.

    InvalidArgumentError: No OpKernel was registered to support Op ‘CudnnRNNV2′ used by {{node Decoder_LSTM_0_7/CudnnRNNV2}}with these attrs: [dropout=0, seed=0, T=DT_FLOAT, input_mode=”linear_input”, direction=”unidirectional”, rnn_mode=”lstm”, is_training=true, seed2=0]
    Registered devices: [CPU, XLA_CPU]
    Registered kernels:
    device=’GPU’; T in [DT_DOUBLE]
    device=’GPU’; T in [DT_FLOAT]
    device=’GPU’; T in [DT_HALF]

    [[Decoder_LSTM_0_7/CudnnRNNV2]] [Op:__inference_keras_scratch_graph_15103]

    It would be greatly appreciated if you could assist me in resolving this issue.

    1. Hmm, yes. The code-base is from 2019 and have started to see it’s age as Keras and TF are moving further. There has been trouble with the CudnnLSTM cells used, try switching those to regular LSTM cells.

  10. The blog post code works fine when I turn off the cuda and GPU. Despite these efforts, the hetero-encoder model has not been successful.

    1. I’m sorry to hear that it hasn’t been successful for you.

  11. Dear professor:
    I’m sorry to disturb you. When I run the DDC object in your code, self.mode==test, and when it later call self.__build_model(), self.mol_to_latent_model reports an error of None. I checked the code and it seems that mol_to_latent_model is not initialized when self.mode==test. I don’t know how to solve this problem, and I really need your help!

    The following is the error message:
    Loading heteroencoder model from: /root/gys/code/TargetGAN/autoencoder/chembl_pretrained
    Initializing model in test mode.
    Loading model.
    ‘mol_to_latent_model’ not found, setting to None.
    WARNING:tensorflow:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
    WARNING:tensorflow:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
    Loading finished in 3 seconds.
    Traceback (most recent call last):
    File “run.py”, line 2, in
    from encode import encode
    File “/root/gys/code/TargetGAN/encode.py”, line 4, in
    from autoencoder import autoencoder
    File “/root/gys/code/TargetGAN/autoencoder/autoencoder.py”, line 23, in
    model = load_model()
    File “/root/gys/code/TargetGAN/autoencoder/autoencoder.py”, line 19, in load_model
    model = ddc.DDC(model_name=path)
    File “/root/gys/code/TargetGAN/ddc_pub/ddc_v3.py”, line 261, in __init__
    self.__build_model()
    File “/root/gys/code/TargetGAN/ddc_pub/ddc_v3.py”, line 705, in __build_model
    x = self.mol_to_latent_model(encoder_inputs)
    TypeError: ‘NoneType’ object is not callable

    1. Something seems to not go as intended in the model loading stage. This code is by now fairly old, so there may have been changes to Keras API since it was written. What Keras version are you using? You would likely need to examine the model loading routine from the DCC code and check it manually line by line in an interactive python session to figure out why you get a None object rather than a Keras model when the code tries to load the model.
      BTW, I’m not a professor as no university wants to have me.

      Best Regards
      Esben

Leave a Comment

Your email address will not be published. Required fields are marked *

*
*