Chapter9 : De Novo Molecular Design with Chemical Language Models

reading notes of《Artificial Intelligence in Drug Design》

1.Introduction

请添加图片描述

These molecular representations are human-made models designed to capture certain properties. The molecules possess the syntactic properties and semantic properties.
Several factors contribute to the popularity of SMILES in the context of deep learning:
- SMILES are strings, which renders them suitable as inputs to sequence modeling algorithms.
- Compared to other string-based molecular representations such as InChI, SMILES strings have a straightforward syntax. This permissive syntax allows for a certain “flexibility of expression”.
- SMILES are easily legible and interpretable by humans.
The tool example demonstrates how deep learning methods can be employed to generate sets of new SMILES strings, inspired by the structures of four known retinoid X receptor (RXR) modulators, using a recently developed method, the bidirectional molecule generation with alternate learning (BIMODAL). The program code is freely available here.

2.Materials

2.1.Computational Methods

All calculations were performed using Python 3.7.4 in Jupyter Notebooks. The models rely on PyTorch and RDKit.
After installing Anaconda and Git, we can run the code below:

git clone https://github.com/ETHmodlab/de_novo_design_RNN.git
cd <path\to\folder>
conda env crate -f environment.yml
conda activate de_novo
cd example
jupyter notebook
1
2
3
4
5
6

2.2.Data

To emulate a realistic scenario, we provide a tool molecule library containing four RXR modulators (Fig. 2). Molecule 1 is bexarotene, a pharmacological RXR agonist. Molecules 2–4 were obtained from ChEMBL and have a potency on RXR (expressed as EC50, IC50, Ki, or Kd) below0.8μM. This set of bioactive compounds (available in the repository, under “/exam- ple/fine_tuning.csv”) will be used to generate a focused library of de novo designs.

3.Methods

3.1.SMILES Notation

请添加图片描述

Stereochemical information is not mandatory but can be specified. The configuration of double bonds is specified using the characters “/” and “\” to indicate directional single bonds adja- cent to a double bond; the configuration of tetrahedral carbons is specified by “@” or “@@”.

3.2.Recurrent Neural Networks

3.2.1.Recurrent Neural Networks

请添加图片描述

To generate new SMILES strings, the start token G can be used as the first input character; the RNN model will then extend this one-token string by attaching valid SMILES characters sequentially, until the end token is output.
While “vanilla” RNNs can, in principle, handle sequences of any length, they are in practice challenged by long-term dependencies, which can lead to gradient vanishing issues during network training. To overcome this limitation, alternative architectures have been proposed, the most popular of which are long short-term memory (LSTM).

3.2.2.BIMODAL

The BIMODAL approach is an RNN-based method that was specifically designed for SMILES string generation.
This non-univocity and non-directionality motivated the development of the BIMODAL sequence generation method, which reads and generates SMILES strings both in the forward and backward directions.

请添加图片描述

Similar to bidirectional RNNs for supervised learning, BIMODAL consists of two RNNs, each for reading the sequence in one direction. The information captured by each RNN is then combined to provide a joint prediction.
The effect of “G” token positioning on SMILES string generation is discussed in Subheading 3.3.

3.2.3.One-Hot Encoding

请添加图片描述

3.2.4.Transfer Learning

请添加图片描述

3.3.Training and Sampling Settings

请添加图片描述

3.3.1.Model Type

The code repository contains two types of models, namely the BIMODAL and the classical “forward” RNNs. The BIMODAL method was used for the worked example. Users can adapt the computational pipeline to forward RNNs, as explained in the accompanying Jupyter notebook.

3.3.2.Network Architecture and Size

The published BIMODAL architecture is based on two levels of information processing. Each processing level is characterized by two LSTM layers (one for forward and one for backward processing), whose information is combined to predict the next token (Subheading 3.3.3).

3.3.3.Starting Point Positioning

请添加图片描述

3.3.4.Augmentation

The possibility of placing the start token at any arbitrary position of the string (i.e., random starting position) allows the performance of a novel type of data augmentation that was introduced for BIMODAL. For each training molecule, one can generate n repetitions of the same SMILES string, in which the start token is placed in a different random position.

3.3.5.Number of Fine-Tuning Epochs

The choice of the number of fine-tuning epochs is usually made on a case-by-case basis by considering (a) the total training time, which is the product of the training time per epoch and the number of epochs, (b) the desired structural diversity of the de novo designs, which generally decreases with increasing number of fine-tuning epochs, and © the desired similarity of the de novo designs and the fine-tuning set in terms of physicochemical properties, which generally increase with longer fine-tuning.

3.3.6.Sampling Temperature

Trained language models can be used as generative methods for sampling novel SMILES strings. One possible approach is temperature sampling. By setting the temperature (T), one can govern the randomness of the generated sequences: $q_i=\frac{e^{z_i/T}}{\sum e^{z_i/T}}$ , where z_i is the RNN prediction for the ith token, j runs over all the tokens in the dictionary, and q_i is the probability of sampling the ith token.
In other words, SMILES are sampled using a Softmax function that is controlled by parameter T. For low values of T, the most likely token according to the estimated probability distribution is selected. With increasing values of T, the probability of selecting the most likely token decreases, and the model generates more diverse sequences (Fig. 10).

3.3.7.Number of SMILES to Sample

The higher the number of SMILES generated using temperature sampling, the higher the possibility to explore the chemical space. Here, we sampled 1000 SMILES strings for each fine-tuning epoch, as described below.

3.4.Generating Focused Molecule Libraries

3.4.1.Molecule Preparation

The pretraining data and fine-tuning set were prepared according to the following procedure:
- Removal invalid, duplicate salt, stereochemical SMILES strings. Note that the SMILES token denoting disconnected structures (“.”) was not present in the BIMODAL token dictionary. Thus, it is not possible to use molecules containing this symbol for training, e.g., SMILES strings representing salt forms.
- SMILES string canonicalization. In this work, canonical SMILES were used for two main reasons: (a) consistency with the original study, and (b) availability of the BIMODAL augmentation strategy for the “G” token position, which allows to generate a sufficient data volume without the need for additional data augmentation.
- Removal of SMILES strings with out-of-bound dimensions. In our data pretreatment pipeline, only SMILES strings encompassing 34–74 tokens were retained.
- Addition of start and end token and data augmentation.
- SMILES string padding.

请添加图片描述

3.4.2.Model Pretraining

Table 4 contains the results of an evaluation of each of the eight models for the following criteria:

Percentage of valid and novel molecules.
Frechet ChemNet Distance (FCD). Generative models should be able to sample molecules with the desired chemical and biological properties. This aspect was evaluated by computing the Frechet ChemNet distance (FCD). The FCD values are based on the activation of the penultimate layer of an LSTM model, which was trained to predict bioactivity. The lower the FCD between two sets of molecules, the closer they are in terms of their structural and predicted biological properties.
Scaffold diversity and scaffold novelty.

3.4.3.Fine-Tuning and Sampling

Transfer learning is performed by updating a pretrained network with fine-tuned data.

相关阅读:
grpc使用consul做服务注册与发现
技术公开课｜深度剖析 Java 的依赖管理，快速生成项目 SBOM清单
算法通关村第19关【黄金】| 继续盘点高频动态规划dp问题
深入探究Spring自动配置原理及SPI机制：实现灵活的插件化开发
python 数据可视化
elementui- Select 选择器-案例: 在v-for循环的多个下拉列表选择后需要传递对应的名字和id
java高级之单元测试、反射
猿创征文｜前端到全栈，一名 IT 初学者的学习与成长之路
深度学习笔记——神经网络（ANN）搭建过程+python代码
bootloader介绍

原文地址：https://blog.csdn.net/weixin_52812620/article/details/126259966

Chapter9 : De Novo Molecular Design with Chemical Language Models

文章目录

1.Introduction

2.Materials

2.1.Computational Methods

2.2.Data

3.Methods

3.1.SMILES Notation

3.2.Recurrent Neural Networks

3.2.1.Recurrent Neural Networks

3.2.2.BIMODAL

3.2.3.One-Hot Encoding

3.2.4.Transfer Learning

3.3.Training and Sampling Settings

3.3.1.Model Type

3.3.2.Network Architecture and Size

3.3.3.Starting Point Positioning

3.3.4.Augmentation

3.3.5.Number of Fine-Tuning Epochs

3.3.6.Sampling Temperature

3.3.7.Number of SMILES to Sample

3.4.Generating Focused Molecule Libraries

3.4.1.Molecule Preparation

3.4.2.Model Pretraining

3.4.3.Fine-Tuning and Sampling