• Chapter9 : De Novo Molecular Design with Chemical Language Models


    reading notes of《Artificial Intelligence in Drug Design》


    1.Introduction

    请添加图片描述

    • These molecular representations are human-made models designed to capture certain properties. The molecules possess the syntactic properties and semantic properties.
    • Several factors contribute to the popularity of SMILES in the context of deep learning:
      • SMILES are strings, which renders them suitable as inputs to sequence modeling algorithms.
      • Compared to other string-based molecular representations such as InChI, SMILES strings have a straightforward syntax. This permissive syntax allows for a certain “flexibility of expression”.
      • SMILES are easily legible and interpretable by humans.
    • The tool example demonstrates how deep learning methods can be employed to generate sets of new SMILES strings, inspired by the structures of four known retinoid X receptor (RXR) modulators, using a recently developed method, the bidirectional molecule generation with alternate learning (BIMODAL). The program code is freely available here.

    2.Materials

    2.1.Computational Methods

    • All calculations were performed using Python 3.7.4 in Jupyter Notebooks. The models rely on PyTorch and RDKit.
    • After installing Anaconda and Git, we can run the code below:
    git clone https://github.com/ETHmodlab/de_novo_design_RNN.git
    cd <path\to\folder>
    conda env crate -f environment.yml
    conda activate de_novo
    cd example
    jupyter notebook
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    2.2.Data

    • To emulate a realistic scenario, we provide a tool molecule library containing four RXR modulators (Fig. 2). Molecule 1 is bexarotene, a pharmacological RXR agonist. Molecules 2–4 were obtained from ChEMBL and have a potency on RXR (expressed as EC50, IC50, Ki, or Kd) below0.8μM. This set of bioactive compounds (available in the repository, under “/exam- ple/fine_tuning.csv”) will be used to generate a focused library of de novo designs.
      请添加图片描述

    3.Methods

    3.1.SMILES Notation

    请添加图片描述

    • Stereochemical information is not mandatory but can be specified. The configuration of double bonds is specified using the characters “/” and “\” to indicate directional single bonds adja- cent to a double bond; the configuration of tetrahedral carbons is specified by “@” or “@@”.
      请添加图片描述

    3.2.Recurrent Neural Networks

    3.2.1.Recurrent Neural Networks

    请添加图片描述

    • To generate new SMILES strings, the start token G can be used as the first input character; the RNN model will then extend this one-token string by attaching valid SMILES characters sequentially, until the end token is output.
    • While “vanilla” RNNs can, in principle, handle sequences of any length, they are in practice challenged by long-term dependencies, which can lead to gradient vanishing issues during network training. To overcome this limitation, alternative architectures have been proposed, the most popular of which are long short-term memory (LSTM).

    3.2.2.BIMODAL

    • The BIMODAL approach is an RNN-based method that was specifically designed for SMILES string generation.
    • This non-univocity and non-directionality motivated the development of the BIMODAL sequence generation method, which reads and generates SMILES strings both in the forward and backward directions.

    请添加图片描述

    • Similar to bidirectional RNNs for supervised learning, BIMODAL consists of two RNNs, each for reading the sequence in one direction. The information captured by each RNN is then combined to provide a joint prediction.
    • The effect of “G” token positioning on SMILES string generation is discussed in Subheading 3.3.

    3.2.3.One-Hot Encoding

    请添加图片描述

    3.2.4.Transfer Learning

    请添加图片描述

    3.3.Training and Sampling Settings

    请添加图片描述

    3.3.1.Model Type

    • The code repository contains two types of models, namely the BIMODAL and the classical “forward” RNNs. The BIMODAL method was used for the worked example. Users can adapt the computational pipeline to forward RNNs, as explained in the accompanying Jupyter notebook.

    3.3.2.Network Architecture and Size

    • The published BIMODAL architecture is based on two levels of information processing. Each processing level is characterized by two LSTM layers (one for forward and one for backward processing), whose information is combined to predict the next token (Subheading 3.3.3).

    3.3.3.Starting Point Positioning

    请添加图片描述

    3.3.4.Augmentation

    • The possibility of placing the start token at any arbitrary position of the string (i.e., random starting position) allows the performance of a novel type of data augmentation that was introduced for BIMODAL. For each training molecule, one can generate n repetitions of the same SMILES string, in which the start token is placed in a different random position.

    3.3.5.Number of Fine-Tuning Epochs

    • The choice of the number of fine-tuning epochs is usually made on a case-by-case basis by considering (a) the total training time, which is the product of the training time per epoch and the number of epochs, (b) the desired structural diversity of the de novo designs, which generally decreases with increasing number of fine-tuning epochs, and © the desired similarity of the de novo designs and the fine-tuning set in terms of physicochemical properties, which generally increase with longer fine-tuning.

    3.3.6.Sampling Temperature

    • Trained language models can be used as generative methods for sampling novel SMILES strings. One possible approach is temperature sampling. By setting the temperature (T), one can govern the randomness of the generated sequences: q i = e z i / T ∑ e z i / T q_i=\frac{e^{z_i/T}}{\sum e^{z_i/T}} qi=ezi/Tezi/T, where zi is the RNN prediction for the ith token, j runs over all the tokens in the dictionary, and qi is the probability of sampling the ith token.
    • In other words, SMILES are sampled using a Softmax function that is controlled by parameter T. For low values of T, the most likely token according to the estimated probability distribution is selected. With increasing values of T, the probability of selecting the most likely token decreases, and the model generates more diverse sequences (Fig. 10).
      请添加图片描述

    3.3.7.Number of SMILES to Sample

    • The higher the number of SMILES generated using temperature sampling, the higher the possibility to explore the chemical space. Here, we sampled 1000 SMILES strings for each fine-tuning epoch, as described below.

    3.4.Generating Focused Molecule Libraries

    3.4.1.Molecule Preparation

    • The pretraining data and fine-tuning set were prepared according to the following procedure:
      • Removal invalid, duplicate salt, stereochemical SMILES strings. Note that the SMILES token denoting disconnected structures (“.”) was not present in the BIMODAL token dictionary. Thus, it is not possible to use molecules containing this symbol for training, e.g., SMILES strings representing salt forms.
      • SMILES string canonicalization. In this work, canonical SMILES were used for two main reasons: (a) consistency with the original study, and (b) availability of the BIMODAL augmentation strategy for the “G” token position, which allows to generate a sufficient data volume without the need for additional data augmentation.
      • Removal of SMILES strings with out-of-bound dimensions. In our data pretreatment pipeline, only SMILES strings encompassing 34–74 tokens were retained.
      • Addition of start and end token and data augmentation.
      • SMILES string padding.

    请添加图片描述

    3.4.2.Model Pretraining

    Table 4 contains the results of an evaluation of each of the eight models for the following criteria:

    • Percentage of valid and novel molecules.
    • Frechet ChemNet Distance (FCD). Generative models should be able to sample molecules with the desired chemical and biological properties. This aspect was evaluated by computing the Frechet ChemNet distance (FCD). The FCD values are based on the activation of the penultimate layer of an LSTM model, which was trained to predict bioactivity. The lower the FCD between two sets of molecules, the closer they are in terms of their structural and predicted biological properties.
    • Scaffold diversity and scaffold novelty.
      请添加图片描述

    3.4.3.Fine-Tuning and Sampling

    • Transfer learning is performed by updating a pretrained network with fine-tuned data.
  • 相关阅读:
    grpc使用consul做服务注册与发现
    技术公开课|深度剖析 Java 的依赖管理,快速生成项目 SBOM清单
    算法通关村第19关【黄金】| 继续盘点高频动态规划dp问题
    深入探究Spring自动配置原理及SPI机制:实现灵活的插件化开发
    python 数据可视化
    elementui- Select 选择器-案例: 在v-for循环的多个下拉列表选择后需要传递对应的名字和id
    java高级之单元测试、反射
    猿创征文|前端到全栈,一名 IT 初学者的学习与成长之路
    深度学习笔记——神经网络(ANN)搭建过程+python代码
    bootloader介绍
  • 原文地址:https://blog.csdn.net/weixin_52812620/article/details/126259966