Data Driven Molecule Generation Using Deep Learning / submitted by Philipp Renz

(1)

Submitted by Philipp Renz Submitted at Institute of Bioinforma-tics Supervisor

Univ. Prof. Dr. Sepp Hochreiter

Co-Supervisor

Mag. Dr. Günter Klam-bauer July 2018 JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at

Data Driven Molecule

Generation Using Deep

Learning

Master Thesis

to obtain the academic degree of

Master of Science

in the Master’s Program

Bioinformatics

(2)

(3)

Abstract

A crucial task in pharmacology, toxicology and drug development is to gen-erate molecule libraries. These libraries are the starting point of all develop-ment projects and are typically screened for molecules with desired biological properties. The success of finding a novel drug depends on the design of the initial library.

Recently, there has been increased interest in data-driven machine learn-ing approaches for generatlearn-ing molecular libraries. In this work, we add to those advances and train recurrent neural networks (RNN) to generate molecules using the SMILES representation to describe structures. We con-firm that RNNs are capable of learning the SMILES-syntax, improve the existing state-of-the-art method and create molecules similar to the ones in chemical databases, concretely ChEMBL. Our model outperforms previ-ously published models in this task with respect to established performance measures.

Additionally we point out the need for better performance measures, and we propose a new evaluation method, the Fréchet ChemblNet Distance (FCD), that can be used to assess generative models for molecules. This new metric relies on comparing the distributions of the activations of a neural net-work that is trained to predict bioactivities of molecules. The advantage of this new metric is that it takes into account chemical and biological informa-tion, as well as the diversity of generated molecules and combines everything into a single score. We show that the FCD can detect differences between sets of molecules by creating artificial datasets biased towards certain molec-ular properties. Apart from our novel contributions, the improvement of the state-of-the-art method and the introduction of a new evaluation criterion, we review existing research and point out inconsistencies and suggest poten-tial improvements and further directions of research.

(4)

(5)

Zusammenfassung

Eine wichtige Aufgabe in der Pharmakologie, Toxikologie und in der Arz-neimittelentwicklung ist die Erstellung von molekularer Bibliotheken. Diese sind der Startpunkt aller Entwicklungsprojekte und werden normalerweise durchsucht um Moleküle mit gewünschten biologischen Eigenschaften zu fin-den. Der Erfolg dieser Suche nach neuen Arzneitmitteln hängt vom Design der originalen Bibliothek ab.

Kürzlich hat es vermehrt Interesse an datengetriebenen Methoden zur Generierung molekularer Strukturen gegeben, die auf maschinellem Lernen basieren. In dieser Arbeit trainieren wir rekurrente neuronale Netze (RNN) um Moleküle in ihrer SMILES-Repräsentation zu erzeugen. Wir bestätigen, dass diese RNN fähig sind den SMILES-Syntax zu verstehen, verbessern die aktuell beste Methode und erzeugen Moleküle die denen in chemischen Datenbanken, hier ChEMBL, ähnlich sind. Unser Modell schlägt bereits ver-öffentlichte hinsichtlich etablierten Evaluationskriterien.

Weiters weisen wir darauf hin, dass bessere Evaluationskriterien benö-tigt werden, und stellen eine neue Methode zur Evaluierung, die Fréchet ChemblNet Distance (FCD), vor, die verwendet werden kann um generative Modelle die Moleküle erzeugen zu bewerten. Diese Methode beruht auf dem Vergleich der Verteilungen der Aktivierungen eines neuronalen Netzes, das trainiert wurde um Bioaktivitäten vorherzusagen. Der Vorteil dieser neu-en Metrik ist, dass sie sowohl chemische und biologische Information, als auch die Vielfältigkeit der erzeugten Moleküle berücksichtigt und alles in ei-nem einzigen Score zusammenführt. Wir zeigen dass die FCD Unterschiede zwischen zwei Sätzen von Molekülen detektieren kann, indem wir Datensät-ze erDatensät-zeugen die basierend auf gewissen molekularen Eigenschaften künstlich verzerrt werden.

Abgesehen von diesen originellen Beiträgen, der Verbesserung des Stan-des der Technik und der Einführung eines neuen Evaluationskriterium, über-denken wir bereits publizierte Forschungsergebnisse und zeigen Inkonsisten-zen auf, und schlagen potentielle Verbesserungen und neue Forschungsansät-ze vor.

(6)

(7)

Chapter 1 Introduction

Progress in medicine has been a major driver of quality of life. Among others pharmaceuticals are an important factor in modern medical practice. How-ever drug development is a very costly and a complex process. Rationalizing the design process of new drugs is very hard, as pharmacological proper-ties are hard to predict and the efficacy of a compound often seems to be governed by chance (P. Schneider and G. Schneider 2016).

To find a new drugs one has to choose candidates from all synthetically accessible compounds. As it is estimated that there are about 1060_(Reymond

et al. 2012) of those, a systematic search for drugs using lab tests is not an option. Recent advances in artificial intelligence and machine learning showed that deep neural networks can learn to solve problems which were previously thought to be too complex to be solved by computers (Silver, A. Huang, et al. 2016; Silver, Schrittwieser, et al. 2017; Mnih, Badia, et al. 2016; Mnih, Kavukcuoglu, et al. 2015). Inspired by these advances, there has also been increased activity in researching data-driven generative models with the aim of creating libraries of new drug candidates (Segler et al. 2017; Jaques et al. 2016; Olivecrona et al. 2017; Gómez-Bombarelli et al. 2018; Guimaraes et al. 2017; Benhenda 2017).

Most of the published work regarding this topic is aimed at creating molecules with certain desired properties of the created compounds, for ex-ample activity against a biological target, which we refer to as “focussed” molecule generation. This aim is reasonable as the ultimate goal is to find a method that, given a set of requirements, can produce compounds which fulfil them. Due to the novelty of data-driven molecule generation it is not yet clear, which of the varying published approaches is best suited to tackle this task and how they can be compared in the first place.

A prerequisite for focussed generation of molecules is the one of being able to generate chemically and biologically meaningful molecules at all. We refer to the process of generating molecules without preferred properties as

(10)

“unfocussed” generation. Good methods in this area allow to build up large libraries of molecules. While unfocussed molecule generation may not be directly useful for drug discovery, due to already existing large libraries, we think that good methods for solving this task are important, because they should provide a solid basis for focussed generation models. Therefore we propose an improved version of the method presented in (Segler et al. 2017), which relies on the use of recurrent neural networks.

Another omnipresent problem in the field of generative models is the one of evaluating the quality of generated samples. We therefore introduce a new metric that is able to rate the performance of a model, by comparing the generated samples with a reference set, based on their chemical and biological properties.

In addition to these original contributions we also try to point out prob-lems in existing research, which are still common due to the novelty of the topic at hand.

The content of this study is organized as follows: In Chapter 2 we in-troduce basic machine learning concepts. After that we cover some funda-mentals related to chemistry and drug discovery in Chapter 3. In Chapter 4 we develop an unfocussed molecule generation model using a maximum like-lihood approach. We introduce the Fréchet ChemblNet Distance (FCD), a new evaluation metric for generative models that produce molecules in Chapter 5. Chapter 6 concerns itself with focussed molecule generation. We, however, only review some already published work and present an idea for future work, but do not include any experiments of our own. In Chap-ter 7 we summarize, review and discuss what we have learned in the course of this study.

(11)

Chapter 2 Machine Learning Background

Machine learning (ML) concerns itself with the construction of algorithms that can learn from data. Instead of explicitly coding programs that solve a given problem, a learning method is presented with data and is trained to recognize patterns in it. ML learning methods often can be used to solve problems better, faster and cheaper than humans. Furthermore they can be used to solve methods that are intractable for humans in the first place.

The field of ML has three dominant sub domains, namely:

• Supervised ML: concerns itself with learning from labelled data • Unsupervised ML: is used to learn hidden structures in unlabelled data • Reinforcement Learning: is used to learn from interaction with an

environment

We focus on supervised and reinforcement learning in this chapter as unsu-pervised learning methods play a minor role in this study.

2.1 Supervised learning

Supervised machine learning concerns itself with predicting an output given an input, solely based on given input/output pairs presented during training. A very simple example of a task is an algorithm that learns to differentiate between pictures of cats and dogs on the basis of labelled examples. If the possible outputs are discrete, as in the example above, one speaks of a classification task. If, however, the outputs that have to be predicted are continuous one is dealing with a regression problem. In both of these tasks one is given a limited number, N, of input/output pairs called the training

(12)

set, Z =        z1 z2 z3 ... zN        =        (x1, y1) (x2, y2) (x3, y3) ... (xN, yN)        , (2.1)

where the xi are the inputs and yi their respective outputs. The goal now

lies in training a model to predict the output for a previously unseen input. Stated more formally, we want to find a function f : X → Y that gives correct outputs y ∈ Y when given an input x ∈ X. The whole field of supervised machine learning deals with finding such a function which is not always easy.

In order to find a good f first it is necessary to define what “good” means in this context. To do this we introduce, p(x, y), the probability distribution of training pairs. The quality of f is measured using a non-negative function L(y, f (x)), called loss, which measures the difference in model outputs and the true values of y. We now want to find a mapping f∗ _{which minimizes}

the expected value of L: f∗ = arg min f R = arg min f Z X×Y L(y, f (x)) p(x, y) dxdy. (2.2) Here we introduced the risk, R, as a shorthand for the expected loss. Sadly we usually do not know the exact form of p(x, y) and have to settle for a finite training set, which means that we cannot find the exact form of f∗_.

We can however use the given samples to estimate the risk. In empirical risk minimization one tries to approximate f∗ _{by finding}

f_emp∗ = arg min

f ∈F

Remp = arg min f ∈F 1 |Z| X (x,y)∈Z L(y, f (x)), (2.3) where Z is the training set and F is a set of functions usually called a model class. As we are dealing with a finite training set F had to be introduced to constrain the possibilities for f here, as otherwise one could just choose:

f (x) = (

y if ∃(˜x, ˜y) ∈ Z | ˜x = x,

0 elsewhere, (2.4)

which would give zero empirical risk, but would be quite useless for values of xwe don’t already know about. The problem that occurs when f focusses to much on specific training data points, that means Remp < R, is known

as overfitting and is one of the major problems in machine learning. We do not want our model to just remember training points, much more it should recognize underlying structures and patterns that help to predict outputs for

(13)

2.1. SUPERVISED LEARNING all input values. As the problem stems from the fact that the learned function can remember all the training data, there are two natural approaches to remedy it, namely using more training data or limiting the ability of the model to remember training data, which can be done by forcing f to be less complex. The overfitting problem is often described as the bias-variance trade-off (see Section 2.1.1).

2.1.1 The bias-variance trade-off

As mentioned in the previous section overfitting is a central problem in super-vised machine learning. To get a more formal view on the issue we calculate the expected mean squared error for some arbitrary x0. Let us assume we

have a training set Z sampled from p(x, y), containing a finite number of samples and one additional datum (x0, y) independently sampled from Z,

where x0 is chosen arbitrarily and y is sampled from p(y|x0). If we now

train a prediction function f(.; Z) using the data in Z, we can calculate the squared prediction error for x0:

(y− f(x0; Z))2. (2.5)

As this estimate, based on one sample only, is not of much worth, we calculate its expected value, the expected prediction error:

EPE(x0) = Ey|x0,Z(y − f(x0; Z))

2_.

(2.6) This corresponds to sampling Z and y over and over again to take into ac-count the variation of y for given x0, as well as the variation in the prediction

function for different training sets. Rearranging (2.6) gives: EPE(x0) = Ey|x0 EZ h (y_{− E}Z[f (x0; Z)] + EZ[f (x0; Z)]− f(x0; Z))2 i = Ey|x0 EZ h (y_{− E}Z[f (x0; Z)])2 i + 2 Ey|x0 EZ h y_{− E}Z[f (x0; Z)] EZ[f (x0; Z)]− f(x0; Z) i + Ey|x0 EZ h (EZ[f (x0; Z)]− f(x0; Z))2 i (2.7) If we look more closely at the third line of the above equation we can see that y_{− E}Z[f (x0; Z)] does not depend on Z any more. Thus we can transform

it to 2 Ey|x0 y_{− E}Z[f (x0; Z)] EZ h EZ[f (x0; Z)]− f(x0; Z) i (2.8)

(14)

and see that it vanishes altogether when looking at the last term. The last line of (2.7) does not depend on y any more and is equivalent to the variance of f(x0; Z).

We can again insert some dummy terms to the second line of (2.7): Ey|x0 h (y− EZ[f (x0; Z)])2 i =Ey|x0 h y_{− E[y|x}0] + E[y|x0]− EZ[f (x0; Z)]2 i =Ey|x0 h y− E[y|x0]2 i +E[y|x0]− EZ[f (x0; Z)] 2 =Var(y|x0) + E[y|x0]− EZ[f (x0; Z)] 2 (2.9)

Putting everything together we get, for the expected prediction error: EPE(x0) =Var(y|x0) + E[y|x0]− EZ[f (x0; Z)]2+Var(f(x0, Z)) (2.10)

The first term in 2.10 is called the unavoidable error. It stems from the fact that y is random, even for given x0. The second term is called the bias.

The bias gives information about how close the average prediction is to the expected value of y. It will be large when the chosen model class is not complex enough. Imagine e.g. trying to fit a straight line to a high degree polynomial. There will be value of x0 where the line is nowhere close to

the polynomial. The last term, called the variance, states how much the predictions vary with different training sets. It will be high when the model class is too complex for the given task.

Unfortunately bias and variance are not independent. Minimizing the one, mostly means that one increases the other, hence the term trade-off.

2.1.2 Maximum likelihood estimation

Maximum likelihood estimation is a way of estimating the parameters of a statistical model. This is done by adjusting the parameters of the model in such a way, that the observed data are the most probable according to the model. More formally one can define the likelihood function of a set of given data X:

L(θ; X) = Y

x∈X

p(x; θ). (2.11)

The maximum likelihood estimate, θMLE is defined to be the parameter that

maximizes this function:

θMLE = arg max

θ L(θ; X) =

Y

x∈X

p(x; θ). (2.12) While for some distributions this parameter can be found by an explicit func-tion depending on the data this is often not the case and one usually has to

(15)

2.1. SUPERVISED LEARNING apply numeric optimization techniques (see Section 2.3). As the optimiza-tion of a product of terms is cumbersome one often does not directly try to maximize the likelihood but its natural logarithm, the log-likelihood:

log_{L(θ; X) = log} Y x∈X p(x; θ) ! = X x∈X log p(x; θ). (2.13) This allows us to use estimate the parameters using stochastic gradient tech-niques (see Section 2.3). The maximum likelihood estimate also has some appealing theoretical properties which include, among others, asymptotic consistency, unbiasedness and efficiency. An estimator, ˆθ, is asymptotically consistent if the estimated value converges to the true value with increasing amounts of data:

lim

|X|→∞

ˆ

θ(X) = θ, (2.14)

where θ is the true parameter vector. One should take care not to confuse this property with unbiasedness. An estimator is asymptotically unbiased if:

lim

|X|→∞Ep(x)[ ˆθ(X)] = θ, (2.15)

An estimator is asymptotically efficient if it is (asymptotically) unbiased and the variance of the estimator converges to the Cramér-Rao lower bound (CRLB):

lim

|X|→∞Var

h ˆ_θ(X)i

= I(θ)−1, (2.16) where I(θ) is the Fisher information matrix.

2.1.3 Risk estimation

Another problem that follows from the ignorance of p(x, y) is that of esti-mating the risk. Given a finite number of samples we are able to train a model, but if we calculate the empirical risk,

Remp = 1 |Z| X (x,y)∈Z L(y, f (x; Z)), (2.17) using the same data as used for training we get a biased estimate. The reason for this is that if our model is overfitting it can remember the right outputs for the given inputs and so we are not actually getting any information about the generalization performance of it, but only of its ability to memorize data. To avoid that problem we can split Z into two non-overlapping subsets Ztrain

and Ztest, the training and the test set. The training set is used to train a

model while the test set is used to estimate the risk. Remp = 1 |Ztest| X (x,y)∈Ztest L(y, f (x; Ztrain)), (2.18)

(16)

Figure 2.1: Visualization of an artificial neural network. The yellow nodes represent the inputs, the blue ones the hidden layers and the red ones the output. Image source: Kottas (2017)

The above equation can be used to get an unbiased estimate of the risk. While the test set should not be too small to get a reliable estimate of the risk it should not take too many data away from the training set, as it would interfere with getting a good model in the first place. If only limited data are available one can also use cross-validation, in which one splits the data set into k subsets. Then one goes through those sets, selecting each as the test set once, while training a model on the rest of the data and estimates the risk. In the end one averages the results to get the final risk estimate.

2.2 Artificial neural networks

Artificial neural networks (ANNs) are parameterized model, loosely based on the biological neural networks in animals. As they are very flexible they can be used for many applications in machine learning. Such networks are usually organized in layers which themselves consist of multiple neurons. Neurons of neighbouring layers are connected. This is depicted in Figure 2.1.

An ANN usually transforms an input vector x into intermediary repre-sentations h(i)_{, from which in turn an output vector y is computed.}

Math-ematically this can be expressed as follows: h(0) = x,

h(i) =g(W◦ (i)h(i−1)+ b(i)) for i = 1, . . . , N, y = f (x; θ) = h(N ),

(17)

2.3. OPTIMIZATION OF ANNS where superscript numbers in parentheses refer to layer numbers, h(i) _{∈ R}di

are the intermediary outputs of the single layers also called the activations of the i-th layer, dithe number of units in layer i and g : R → R a non-linear

function applied elementwise to its argument vector (denoted by superscript ◦). W(i) are called weight matrices and the vectors b(i)are referred to as the biases for each layer. We also introduce the parameter vector θ as the concatenation of all W(i) _{and b}(i)_{. If the network parameters (entries of}

W(i) and b(i)) are tuned to the right values such a network can carry out very complex computations. In the next section we take a look at how this can be done.

2.3 Optimization of ANNs

ANNs are usually trained using gradient descent algorithms. Given a training set Z and an ANN with randomly initialized parameters one can calculate the empirical risk:

Remp(θ) = 1 |Z| X (x,y)∈Z L(y, f (x; θ, Z)). (2.20) As Remp depends on the parameters of the network in a continuous way

one can calculate the gradient with respect to the them and optimize model with known gradient descent algorithms. The simplest of those is just to iteratively update the parameters using:

θ _{← θ − α∇}θ

X

(x,y)∈Z

L(y, f (x; θ, Z)), (2.21) where α ∈ R determines the step size and is often called learning rate. However, a plethora of gradient descent algorithms are applicable here, e.g. including a momentum term in the update rule or more advance methods like Adadelta (Zeiler 2012), Adagrad (Duchi, Hazan, and Singer 2011) or the popular ADAM (Kingma and Ba 2014).

Optimization of ANNs in this way is often called backpropagation be-cause the learning signal is propagated back through the network from the deepest layer in contrast to the forward pass used to calculate the outputs.

If the given dataset is very large it might not be computationally practical to use all samples for computing the gradient in each iteration of gradient descent. To circumvent this problem one can estimate the gradient based only on a randomly chosen subset of the training data in each iteration. Such subsets are called mini-batches and this approach is called stochastic gradient descent. Computing the gradient using only one sample per iteration is called online learning. Using all samples in every step is called batch learning. Using mini-batches is established practice in many machine learning tasks

(18)

h(1)0 h (1) 1 h (1) 2 h (1) 3 x0 x1 x2 x3 y0 y1 y2 y3

Figure 2.2: Information flow in a recurrent neural network

as the sizes of the datasets are often enormous and modern, complex neural network architectures run on GPUs often cannot process many samples at a time.

2.4 Recurrent neural networks

While ANNs as described above are very flexible and can be used for lots of tasks they run into problems when the inputs are not of fixed size but sequences with variable lengths, as it is the case in machine translation or sentiment analysis. One of the most popular approaches for such problems is the use of recurrent neural networks (RNNs). RNNs are very similar to ANNs in their structure but instead of only one input vector they are given whole sequences to process and are enhanced by a recurrent connection to enable information flow through time. This is shown in Figure 2.2. More formally this can be written as:

h(0)_t = xt,

h(i)_t =g◦W(i)h(i−1)_t + R(i)h(i)_t−1+ b(i) for i = 1, . . . , N, yt= f ({xt0 : t0 ≤ t}; θ) = h(N )

t ,

(2.22)

where the W(i)_{, R}(i) _{and b}(i) _{are matrices/vectors of tunable parameters.}

The vectors h(i)

t at a time t are commonly referred to as the state of the

network, which contains all stored information about previous inputs. The yt form an output sequence of which, depending on the task, only a part

may be of interest.

2.4.1 Long Short-Term Memory

Unfortunately RNNs with the form described in (2.22) have problems with learning long term dependencies (Hochreiter, Yoshua Bengio, et al. 2001; Y. Bengio, Simard, and Frasconi 1994). To mitigate that problem Long Short-Term Memory (LSTM) networks were created (Hochreiter and Schmidhuber

(19)

2.4. RECURRENT NEURAL NETWORKS

Figure 2.3: LSTM. Image source: Colah (2015)

1997). Instead of the very simple recurrent connections above a much more complicated architecture is introduced:

h(0)_t = xt, ı(l)_t =σ◦ W_i(l) h (l−1) t h(l)_t−1 ! + b(l)_i ! f_t(l) =σ◦ W_f(l) h (l−1) t h(l)_t−1 ! + b(l)_f ! o(l)_t =σ◦ W_o(l) h (l−1) t h(l)_t−1 ! + b(l)_o ! g(l)_t =tanh◦ W_g(l) h (l−1) t h(l)_t−1 ! + b(l)_g ! c(l)_t = f_t(l)_{◦ c}(l)_t−1+ ı(l)_{◦ g}_t(l) h(l)_t = o(l)_t ◦ c(l)t yt = f ({xt0 : t0 ≤ t}; θ) = h(L) t . (2.23)

At a first glance this looks quite intimidating. Let us take a view at Fig-ure 2.3 and try to shed some light on the concept of LSTM. A new idea is the use of the so called gates: it, ft, ot. The elements in those vectors are

squashed to values between 0 and 1 by a sigmoid function. The central part of an LSTM cell is the so called cell state ct, which is the main memory of the

network. At each time step the forget gates decide which information should be forgotten. At the same time a new vector g, full of possibly relevant information is computed. This information is written to the cell state but only after the input gates have decided what actually shall be memorized. Now the output gates decide which information in the cell state should be

(20)

passed on to the output. One can see that the cell state is quite well pro-tected by input and output gates, which has reasons we do not go into here. Figure 2.3 shows only a network with one layer but it is possible to stack more LSTM-cells onto each other and use the output of one as input to an-other. Networks of this kind are often referred to as stacked LSTMs. It is fair to say that LSTMs have played quite are large role in recent advances in many fields, such as machine translation (Sutskever, Vinyals, and Le 2014) or sequence generation (Graves 2013) among others.

The architecture of an LSTM-cell is quite complex, so it is not too easy to see if all the components are actually needed. To find out how differ-ent variants of LSTM perform Greff et al. (2017) performed a large scale study to assess the importance of the single elements and found that minor modifications do not significantly affect performance, but that coupling of input and forget gates, f = 1 − i, simplifies the architecture without loss of performance.

2.4.2 Gated Recurrent Units

Recently a similar cell type called Gated Recurrent Unit (GRU)(Cho et al. 2014), that is motivated by LSTM, emerged. GRUs try to solve the same problem as LSTM but their architecture varies quite a bit from it.

The computations in GRU-networks are the following: Firstly there is the reset gate:

rt= ◦

σ(Wrxt+ Urht−1), (2.24)

where σ is the sigmoid function again and upper set circle denotes elemen-twise application and Wr and Ur are learned (or yet to learn) parameter

matrices. Next once computes a candidate vector: e

ht= ◦

tanh (W xt+ U (r ht−1)) , (2.25)

where denotes an elementwise product. Here one can see where the reset gate got its name from: If some values in r are close to zero, the respective values of ht−1 will be reset for the computation of the candidate vector.

Then one needs to calculate the update gate

z =σ(W◦ zxt+ Uzht−1). (2.26)

Finally, the updated network state is calculated:

ht= z ht−1+ (1− z) eht (2.27)

GRUs have shown performance on par with LSTM networks on many tasks (Jozefowicz, Zaremba, and Sutskever 2015) despite being a bit simpler,

(21)

2.5. REINFORCEMENT LEARNING but as of yet there is no clear evidence that one is actually better over a broad set of tasks.

As there is a lot of arbitrariness in those equations it is not clear from the outset that those networks actually work well, and if yes, that they actually are good designs standing out among similar ones. To get an answer to this question Jozefowicz, Zaremba, and Sutskever (2015) explored many architectures similar to LSTM/GRU and did not find one that consistently outperformed them.

2.5 Reinforcement Learning

Reinforcement learning (RL) concerns itself with the study of how to learn by interacting with an environment. An agent does this by observing how different actions affect future rewards. In contrast to supervised learning where one is given expert knowledge on what to do given an input, RL is complicated by the need for exploration of the environment, that means that an agent has to try out different behaviours to find out which is the best one. We introduce a few important core concepts following the excellent book by (Sutton and Barto 1998), which is recommended for a more thorough treatment of the topic.

RL problems are usually stated in the theoretical frame of Markov De-cision Processes (MDPs). A MDP is defined by the state space S, a set of states that the environment can be in, the action space A, a set of actions that can be performed by the agent and the one-step time dynamics,

p(s0, r|s, a) ˙= Pr(St+1= s0, Rt+1= r|St= s, At= a), (2.28)

the probability of the system to transition to state s0 _{giving reward r from}

state s when choosing action a. We concentrate on episodic tasks only, which end after a finite number of time steps at time T , here as contin-uing problems are not of interest concerning this work. An episode of such a task can be described as a sequence of states, action and rewards S0, A0, R1, . . . , ST, AT, RT. At any time t the natural goal is to maximize

the sum of future rewards, which we will call the return:

Gt = R˙ t+1+ Rt+2+ Rt+3+· · · + RT. (2.29)

We define a policy, π(a|s), as a map from states to actions, which gives the probability of taking action a in state s. To denote “how good” it is to be in a certain state we define the value function with respect to policy π as the expected return following state s, when acting according to policy π:

vπ(s) ˙= Eπ[Gt St= s] = Eπ " _T X t0_=t+1 Rt0 S_t0 = s # , (2.30)

(22)

where Eπ denotes the expected value of a random variable when following

π. Another important concept is that of state-action value functions, which measure how good it is to take a certain action in each state:

qπ(s, a) ˙= Eπ[Gt St= s, At= a] = Eπ " _T X t0_=t+1 Rt0 S_t0 = s, A_t0 = a # (2.31) Our goal can now be described as finding an optimal policy, that means a policy π∗ for which vπ∗(s)≥ vπ(s)for all s ∈ S and all other policies π. Now

that we have established the basic terminology we can actually dive into a few ways how to do this, while exploring some new concepts.

2.5.1 Monte Carlo Methods

A rather straight forward way to find a good policy is to start with a ran-dom policy π and follow it for one episode and estimate qπ(s, a) from the

observed returns. Then one acts based on the greedy policy with respect to the current values of qπ(s, a), that means π(a|s) = arg maxaq(s, a). For

simple problems these estimates can be stored in tables, while complex prob-lems make necessary the use of function approximation (see Section 2.5.2) Then one creates another episode using the updated policy and continues these steps iteratively. One should use random actions from time to time to guarantee exploration.

2.5.2 Function Approximation

When the state or action space is big one can no longer store action values in tables. The game of Go, for example, has around 2 · 10170 _{valid states}

(Tromp 2005), much more than the estimated number of atoms in the ob-servable universe, so there is no way we can just store all the action values in a table. For solving tasks with large state spaces, one usually relies on function approximation techniques. That means one has to train a function to approximate the action-values ˆq(s, a; θ) by adjusting its parameter vector θ.

2.5.3 Policy gradients

All of the approaches above rely on learning action-values from which policies can be derived. Policy gradient methods instead directly use a parameterized function π(a|s; θ) and try to improve it. The question that arises is, what the update rule for the parameters should be. Intuitively we want to encourage actions that yielded high reward and avoid actions that give a poor one. This idea is formalized in the REINFORCE rule (Williams 1992):

(23)

2.5. REINFORCEMENT LEARNING Using this method is much more natural for tasks with stochastic optimal policy. To reduce variance in the parameter updates one can also estimate a value function and subtract it as a baseline:

θt+1 = θt+ α(Gt− ˆv(St; w))∇θlog π(At|St; θ), (2.33)

(24)

(25)

Chapter 3 Life science background

In this chapter we introduce basic chemistry concepts that will be needed to understand the later chapters. We start by describing SMILES a method to describe molecular structures and give an overview over drug discovery.

3.1 SMILES Strings

SMILES, short for “Simplified molecular-input line-entry system”, is a nota-tion for specifying structure formulas of molecules using text strings intro-duced in (Weininger 1988). For practical reasons the term SMILES often also is used to refer to a single or multiple SMILES strings. Normally the meaning should be clear from the context.

SMILES strings are computed from 2D structure formulas commonly used by first breaking cycles and then walking through this tree using depth-first tree traversal, after hydrogen atoms are removed from it. The sites where rings have been cut are marked using numeric labels. In what fol-lows we shortly describe the core features of the SMILES syntax without mentioning the less often used ones. The common abbreviation of chemical elements in square brackets is used to denote Atoms, where the brackets can usually be omitted for the molecules in the organic subset “B, C, N, O, P, S, F, Cl, Br, I”. The characters ”-, =, #, :” are used to denote single, double, triple and aromatic bonds respectively. If no bond is specified between to atoms it is assumed that there is a single bond. Hydrogen atoms can, but are usually not explicitly mentioned. A few simple examples are: C denotes methane(CH4), CC ethane (C2H6), C=C ethene (C2H4) and C#C ethyne (C2H2).

Rings are denoted using numeric indices, 1,2,3,... etc. For example cy-clohexane (C6H12) is denoted as C1CCCCC1, Benzene (C6H6) can be denoted,

using the aromatic bond symbol :, as C:1:C:C:C:C:C1 or using a lower case c as a shorthand: c1ccccc1. The lower case letters b, n, o, p, s can be used in the same way. Branching is denoted using parentheses. CC(O)=O, for

(26)

Figure 3.1: Comparison of forward and reverse pharmacology. Image source: Atanasov et al. (2015)

example, describes acetic acid. The opening parenthesis shows that a branch is splitting off and the closing one signals the end of this branch. Both the first part of the substring inside the parentheses, as well as the substring behind it is bound to the last atom before them. Those basic rules are just a subset of a more complex grammar, but should be enough to understand what follows in the later chapters.

3.2 Drug discovery

Drug discovery is the process of finding new potential medications. Over the past centuries this procedure has evolved substantially. Early drug discovery was based on traditional medications and on discovery by luck. As time went on this rather unorganized approach has been replaced by research carried out in large, well funded laboratories.

The beginnings of drug discovery lay in an approach called forward pharmacology. Known substances have been tested for biological effects in vivo/vitro without being aware what their mechanism of action is or if there is any at all. Only after a compound has shown an effect, expenditures are made to try to find out how it actually works in an organism or a cell.

In recent times, however, this process was somewhat turned upside down. In an approach called reverse pharmacology not the compounds but biologi-cal targets are the point of departure, which are chosen based on knowledge

(27)

3.2. DRUG DISCOVERY

Figure 3.2: Drug discovery timeline. Image source: Boghog (2015)

about biological pathways. Then known molecules are tested for activity against the target. This method is the prevailing one in today’s drug dis-covery (Swinney and Anthony 2011). A major driver towards this trend was the emergence of high throughput screening in which millions of chemical compounds can be tested for activity against a target at an affordable cost. Figure 3.2 shows the usual drug discovery cycle. A compound collection is tested in primary assays to check if they have an effect on the target. The hits are then tested in secondary assays, also called cross-screenings, for other important properties. Compounds that are most suitable according to the obtained results are further investigated and are called leads. As it is unlikely that the leads are already suitable for use as a drug, one tries to further optimize their characteristics. Among others, some important qualities of a drug are, how strong it binds to a target, its ADME-properties (absorption, distribution, metabolism, excretion), side effects and its toxicological profile. After changes have been made to the leads this cycle continues until a clinical candidate emerges.

Figure 3.3 shows the usual timeline for the discovery and development of one approved drug. At the start thousands of compounds are tested in high throughput screenings and undergo the process described above. Then a few hundred molecules are tested in preclinical trials which involves both in vitro, as well as in vivo tests. A much criticized aspect of this part of development is animal testing. These tests are used to determine if it is safe for the compound to proceed to clinical trials, i.e. testing in humans. As one can imagine this is a very lengthy, expensive and financially risky process.

(28)

10,000 COMPOUNDS

250

COMPOUNDS 5 COMPOUNDS APPROVED 1 FDA DRUG

~6.5 YEARS ~7 YEARS ~1.5 YEARS

DRUG DISCOVERY

PRECLINICAL

CLINICAL TRIALS FDA REVIEW

Drug Discovery &

Development-Timeline

Figure 3.3: Drug discovery timeline. Image source: „Drug Discovery and Development“ (2011)

3.2.1 Drug design

The process of drug design, or more specifically rational drug design, is di-vided into two main areas: structure-based design and ligand-based design (H.-J. Huang et al. 2010). Ligand-based design can be used when the struc-ture of the target is unknown, but information about the binding activity of ligands is known. One can then use statistical methods to identify pharma-cophores that need to be present in a compound for it to be active against a target.

Knowledge of the structure of the target,however, makes possible the use of structure-based design techniques. This information enables a researcher to check if known molecules fit into the binding pocket of a target in a process called screening and to improve them further, if possibly suitable compounds are found. Another approach is de novo design, in which the three-dimensional structure is used to design new compounds from scratch. In-silico techniques play an important role in drug design, as they allow to virtually screen large compound libraries and to calculate binding affinities at low acceptable costs.

(29)

Chapter 4 Unfocussed molecule

generation

Segler et al. (2017) have shown that RNNs are able to create new molecules using their SMILES representation. The goal we want to achieve in this chapter is to train a generative model that can produce new SMILES-strings by exposing it to a set of molecules from ChEMBL (Bento et al. 2014). We improve existing methods in the areas of data preprocessing and training procedure. We include an analysis how the hyperparameter choices affect the model performance. We first review already existing results, then introduce the methods we use and then state our experimental results, including a comparison between our model and other generative models. The evaluation results show that the model we trained outperforms the one of Segler et al. (2017) in generating a molecule set that resembles ChEMBL molecules.

4.1 Introduction

Drug design is a notoriously hard problem, as drugs have to fulfil a multitude of requirements to be suitable for medical use. A common way to find new drugs is to synthesize possibly appropriate molecules and test them in-vitro in high-throughput screenings. These screenings are however limited in size to about a million compounds (G. Schneider and Baringhaus 2008) and ex-pensive. In order not to waste resources on testing compounds that are not likely to be developed into drugs one can pre-select promising molecules in-silico in a process called virtual screening. To build up the libraries of virtual compounds hitherto two main techniques have been used, the first of which is building up molecules from handcoded groups of atoms, and the latter be-ing the use of virtual chemical reactions to “synthesize” compounds digitally with the hope that these reactions are also possible in the lab (Hartenfeller and G. Schneider 2011). However, recently there has been increased interest

(30)

in new generation methods involving generative machine learning methods (Segler et al. 2017; Olivecrona et al. 2017; Jaques et al. 2016; Benhenda 2017; Gómez-Bombarelli et al. 2018; Guimaraes et al. 2017). In this chapter we want to add to those advances and answer questions that were not suffi-ciently treated hitherto. Here we confine our efforts on a maximum-likelihood approach and do not treat training models using transfer, reinforcement or other kinds of learning, which are mentioned in Chapter 6. In the following sections we highlight some problems present in existing research, propose our own model and investigate how different model architectures influence the performance and how to rate the performance of such a model in the first place.

4.2 Previous methods

Gómez-Bombarelli et al. (2018) trained a variational auto-encoder to gener-ate new molecules. The rgener-ate of genergener-ated valid SMILES-strings is only 4–70 % which is quite low compared to our results (cf. Section 4.4.1) and other work (Segler et al. 2017; Olivecrona et al. 2017). This wide range probably has to do with the choice of the latent representation from which decoding starts.

Segler et al. (2017) used an RNN to predict the next characters of SMILES to generate new ones and came to the conclusion that the generated molecules are similar to the ones presented in training. While we think the parts of the this work relating to chemistry are well done and very extensive, there are a few things that could be improved. Firstly, the SMILES could have been parsed into atoms of the SMILES-grammar. Furthermore a hy-perparameter search could have been done to check which network settings work the best. Lastly, and most importantly, we think they did not train their RNN properly. They used truncated backpropagation through time, unrolling their network for only 64 time steps, which causes problems for finite sequences (see Section 4.5).

Jaques et al. (2016) also used a similar approach, but only 30% of the created molecules turned out to be valid SMILES. This might be caused by the use of a fairly small network (1 layer LSTM with 100 cells), but a similar model that we trained (see. 4.4.1) reaches much higher performance and we do not know the cause of this difference.

Olivecrona et al. (2017) also trained an RNN (3 layers with 1024 GRUs each), to generate new SMILES strings. 94% of the molecules generated by his network turned out to be valid. However he filtered the training set to only contain molecules consisting of 10 to 50 heavy atoms, made up only by atoms ∈ {H, B, C, N, O, F, Si, P, S, Cl, Br, I}, which somewhat simplifies the problem. While this work is very interesting, mistakes are abundant in

(31)

4.3. METHODS this study (see Section 6.1.3).

4.3 Methods

In this section we describe our approach to training a generative model that is able to create new molecules in the form of SMILES-strings by feeding them into an RNN one by one and trying to predict the next character based on those previously seen. As a lot of grammatical atoms in the SMILES syn-tax are represented by multiple characters (e.g. Cl chlorine), presenting raw SMILES-strings to the RNN adds additional complexity to the learning task as the network cannot differentiate between the C in a chlorine atom or one that stands for a carbon atom. This suggests that assigning a numerical token to each grammatical atom and translating raw SMILES-strings to se-quences of tokens might be beneficial. This can be easily achieved by using a few simple regular expressions, as the SMILES-grammar is not very com-plex. A special EOS token is added to these sequences to signal the end of a sequence and is needed so that the model is able to end sequences. Each se-quence will be represented as a sese-quence of integers, M = (x1, x2, x3, . . . , xT)

with xi ∈ {1, . . . , number of tokens}, where M is used to refer to a sequence,

T is its length and xT = EOSfor all sequences.

Our goal is to learn to learn the probability distribution of those molecules. When using the SMILES-syntax, the true probability distribution p(M) of molecules M can be written down in product form as follows:

p(M ) =

T

Y

t=1

p (xt|x1, . . . , xt−1) , (4.1)

where (x1, . . . , xt−1) is empty if t = 1. As the probability of each element

of the sequence depends on all the previous ones we use an RNN to obtain a parameterized model that can reflect this fact. If we are given a set M = {M1, . . . , MN} of molecules Mi = (xi₁, . . . , xi_{|M |})and denote the parameters

of the RNN with θ, the likelihood of the dataset depending on the model parameters is: L(θ; M) = |M| Y i=1 p(Mi; θ) = |M| Y i=1 |Mi| Y t=1 p xi_t|xi 1, . . . , xit−1; θ . (4.2)

Taking the logarithm of this expression yields the log-likelihood: log_{L(θ; M) =} |M| X i=1 |Mi| X t=1 log p xi t|xi1, . . . , xit−1; θ . (4.3)

The negative log-likelihood is equal to the cross-entropy loss and we will use the latter term from now on to refer to it.

(32)

s1

s0 s2 s3 s4 s5 s6 s7 s8

GO c 1 c c c c c 1

c 1 c c c c c 1 EOS

Figure 4.1: Training phase of next character prediction RNN. The input sequence (benzene) is ten symbols long. The first nine symbols are used as inputs (blue at the bottom), the last nine as targets (red at the top).

To train the network we feed the input-sequence, (x1, . . . , xt−1), into the

network as one-hot encoded vectors one at a time, which then returns an output vector yt at each time step. The outputs are used to compute a

probability distribution over possible next tokens using a softmax-function: p (xt|x1, . . . , xt−1; θ) = e[yt]j N P k e[yt]k , (4.4)

where [yt]k is the k-th component of yt. It is apparent from (4.3) that the

log-likelihood is a nested sum, of which one ranges over the molecules in the dataset and the other one over the elements in the molecule sequences. When implementing a learning algorithm one usually does not optimize the sum of those terms but their average. One could easily make the mistake of first averaging the loss across single sequences, which would result in

logL(θ; M) = 1 |M| |M| X i=1 1 |Mi| |Mi| X t=1 log p xi_t|xi 1, . . . , xit−1; θ , (4.5)

which is not proportional to the log-likelihood due to differing sequence lengths. Instead, if one wants to use a normalized loss, one should rather use log_{L(θ; M) =} 1 |M| P i=1|M i| |M| X i=1 |Mi| X t=1 log p xi_t_|xi₁, . . . , xi_t−1; θ . (4.6)

While it is not imperative to do this one should take care which quantity one is actually optimizing.

To simplify the implementation we added a GO-token at the beginning of each sequence (see Figure 4.1). When training of the network has finished one can generate new sequences by feeding this GO-token into the network and randomly sample the next token from the probability distribution obtained as output from the model and feed it back into the network to get the probability distribution for the next token and so forth. Generation stops once the model samples and EOS-token.

(33)

4.3. METHODS

4.3.1 Evaluation

Evaluation of generative models, in general, is a hard problem. Given a set of samples generated by a model and one from the real world, there is no straight forward path to assess the quality of the artificial set.

Using a maximum likelihood approach, as is the case in this study, facil-itates comparison of models, as there is a scalar measure, the likelihood, of how good a model is at learning the data distribution. But in our specific problem the likelihood on its own is not necessarily a good quality measure, because one wrongly predicted token might render a molecule syntactically invalid while not having a strong impact on the likelihood. Combined with the fraction of valid SMILES the likelihood can be used to compare models. This method of comparing models, is not satisfactory as not all generative models maximize likelihood and even if this were the case varying modelling approaches make it impossible to assess performance differences by compar-ing the likelihood.

The fraction of syntactically valid molecules is also an important quality measure for the model at hand. While it is necessary to have a generator that is able to learn the SMILES-syntax, having a high fraction of valid molecules is in no way sufficient to rate the quality of a model, as one could simply sample trivial molecules like CCC over and over again, which is not expedient. Furthermore a model with a low fraction of valid generated molecules might still produce interesting molecules. If the molecule generation is computa-tionally expensive low efficiency of generating molecules might be a problem. One can also compare the distributions of SMILES-lengths of the gen-erated samples with that of real world molecules. Matching distributions are again only a necessary constraint as the samples might again be trivial molecules (e.g. CC...CC) with lengths chosen in such a way that the length distributions are matched.

Another way one might compare the sets is by comparing molecular properties that are computable by cheminformatics software, such as sol-ubility (Olivecrona et al. 2017), number of rotational bonds (Olivecrona et al. 2017), number of aromatic rings (Olivecrona et al. 2017), synthetic ac-cessibility (Guimaraes et al. 2017; Jaques et al. 2016; Yang et al. 2017) and druglikeness (Guimaraes et al. 2017; Jaques et al. 2016). Comparisons of distributions are more descriptive than the mere statement of mean and standard deviation.

As mentioned there is no objective and universal criterion to measure the performance of a generative model one will always have to use a heuristic. Inspired by recent work in quality assessment of generative models for im-ages (Heusel et al. 2017) we propose a new evaluation method, the Fréchet ChemblNet Distance(FCD) in Chapter 5 and use it to evaluate our model.

(34)

4.4 Experiments

We downloaded the chemical representations of all the molecules listed in the ChEMBL database from the EBI website and extracted the SMILES-strings from them and canonicalized them using rdkit (RDKit 2006). 59 out of the 1727113 compounds could not be converted to rdkit-molecules and were not included in the training set.

Our next step was to parse all the SMILES strings into grammatical atoms and create a numerical encoding. This results in a sequence of integers for every SMILES-string, each starting with the GO-token and ending with the EOS-token. As most of the the SMILES-strings are not too long (see Figure 4.7) we decided to discard compounds with a length longer than 200, to avoid running into memory problems and reduce runtime. Before training we split the molecules in a training and test set. All reported cross-entropy values were calculated using the test set. We did not see any signs of overfitting when looking at the results which probably has to do with the size of the training set.

4.4.1 Hyperparameter search

To determine good hyperparameter settings we performed a grid search using following parameters:

• RNN-cell type: GRU, LSTM • number of hidden layers: 1, 2, 3

• number of hidden units: 100, 300, 600, 1000

While grid search is not necessarily the most efficient method to find good hyperparameters it gives comparisons of models where only one hyperparam-eter was changed, which can give a better picture of the importance of the single parameters. GRU and LSTM networks are widely used so we included the two different cell types into the hyperparameter search.

In Figure 4.2 it can be seen that at the beginning of training networks using GRUs outperform the LSTM networks. But over time this relationship is inverted and the LSTMs have the upper hand at the and of the day (which in this case is after 50000 weight updates). This pattern can be seen as well if two or three hidden layers are used. One must note however that LSTMs have more parameters and take longer to train, when using the same number of hidden units. GRUs are unstable when using 1000 hidden units as can be seen in Figure 4.3. The performance collapses for unknown reasons. The performance depending on all of the hyperparameters is shown in Figure 4.4. One can see that the best networks use LSTM cells. An interesting thing to note is that for 600 and 1000 hidden units the 2-layer LSTM networks

(35)

4.4. EXPERIMENTS 0.6 0.8 1 Cross en trop y

100 Hidden units 300 Hidden units

0 1 2 3 4 5 ·104 0.6 0.8 1 Steps Cross en trop y 600 Hidden units 0 1 2 3 4 5 ·104 Steps 1000 Hidden units GRU LSTM

Figure 4.2: Comparison between LSTM and GRU networks. It can be seen that GRU are able to learn faster at the beginning, but are ultimately over-taken by the LSTM networks.

0 2 4 ·104 0.6 0.8 1 Steps Cross en trop y 1 layer 0 2 4 ·104 Steps 2 layers 0 2 4 ·104 Steps 3 layers GRU LSTM

Figure 4.3: When using 1000 hidden units per layer with GRUs the models diverge for unknown reasons.

(36)

1 2 3 0.5 0.6 0.7 0.8 Number of layers Cross en trop y 1 2 3 0.2 0.4 0.6 0.8 1 Number of layers F raction valid GRU LSTM

Figure 4.4: Cross entropy error depending on hyperparameters. The number of hidden units is encoded by the size of the circles. From small to big the sizes correspond to 100, 300, 600, 1000 hidden units per layer. LSTM networks perform better than GRU networks.

perform better than those with 3 layers. This is not the case when using 100 or 300 units per layer. When using fewer units additional layers might be beneficial to performance as adjustable parameters are scarce. But when using more hidden units this gain might be overshadowed by information loss throughout the layers. This suggests that using three layers or more might not make sense.

A model optimized for low cross entropy does not necessarily have good performance in terms of the percentage of valid generated smiles. As the grammar rules are rather strict and even one wrongly predicted token can render a whole sampled string invalid. For example networks with few pa-rameters might concentrate on learning easy patterns and neglect correct bracket matching or ring opening and closing.

To see what the relation between cross entropy error and percentage of valid SMILES is, we generated 25600 compounds with each model and plotted the percentage of valid SMILES against the cross entropy error in Figure 4.5. One can see that all of the trained models, except for the crashed GRU-networks, are arranged along a narrow path, which means that cross entropy and the fraction of valid molecules are highly correlated. Therefore it makes sense to improve the cross entropy loss if one wants to improve the fraction of valid SMILES.

Another metric is the FCD (see Chapter 5). In Figure 4.6 the FCD is shown for all of the tested hyperparameter settings. On the left it can be seen that the model with the lowest FCD has only one layer. The right figure shows that cross entropy and FCD are correlated. This means that optimiz-ing for cross entropy also has an effect on the FCD. It is quite surprisoptimiz-ing that

(37)

4.4. EXPERIMENTS 0.5 0.6 0.7 0.8 0.2 0.4 0.6 0.8 1 Cross entropy % valid GRU LSTM

Figure 4.5: Cross entropy plotted against percentage of valid SMILES. From small to big the size of the markers corresponds to 100, 300, 600, 1000 hidden units per layer. Lower cross entropy implies high fraction of valid molecules.

1 2 3 0.316 1 Number of layers F CD 0.5 0.6 0.7 0.8 Cross entropy GRU LSTM

Figure 4.6: FCD scores for different hyperparameters. From small to big the size of the markers corresponds to 100, 300, 600, 1000 hidden units per layer.

(38)

the model with lowest FCD has only one layer, the one with lowest cross en-tropy two layers and the one with the highest fraction of valid smiles three layers. It seems that stacking more layers somehow hinders information flow and that this has more effect on the FCD than the cross entropy. If hindered information flow through the layers is really the problem here it might be beneficial to use skip-connections (Graves 2013).

4.4.2 Comparison to other generative models

Here we want to compare the model we trained to those presented in already published work. To do this we obtained sets of generated molecules from multiple authors. The sets we compare here are the following:

• smilesLSTM+: The model proposed in this study

• SeglerLSTM: A next-character prediction LSTM (Segler et al. 2017) • Olivecrona canon agent: Reinforcement learning agent trained to

pro-duce molecules active against the DRD2-receptor (Olivecrona et al. 2017)

• Olivecrona reduced agent: Same as above but some samples were with-held during training.

• Benhenda ORGAN 30: A generative adversarial network introduced in (Benhenda 2017) trained for 30 iterations. This network is trained to produce molecules active against the DRD2-receptor.

• Benhenda ORGAN 60: Same as above but trained for 60 iterations. • Benhenda RL 40: A reinforcement learning agent also introduced in

(Benhenda 2017) trained for 40 iterations. Also trained to produce molecules active against the DRD2-receptor.

• Benhenda RL 60: Same as above but trained for 60 iterations.

Some statistics of the sets mentioned above are shown in Table 4.1, which features the following columns:

• Samples: The size of the generated set of molecules. • Samples valid: Number of valid molecules in the set. • % valid: Percentage of the created SMILES that were valid

• Average length: Average length of the valid SMILES after they have been canonicalized.

(39)

4.4. EXPERIMENTS Samples Samples

valid % valid Averagelength molecules% unique Biased smilesLSTM+ 716794 650808 90.79 47.30 97.99 No

SeglerLSTM 454700 454080 99.86 43.72 99.87 No

Olivecrona

canon agent 128000 127210 99.38 37.59 41.27 Yes Olivecrona

reduced agent 128000 127100 99.30 34.75 39.53 Yes Benhenda Organ 30 32000 13638 42.62 24.10 88.33 Yes Benhenda Organ 60 32000 19354 60.48 25.79 61.47 Yes Benhenda RL 40 32000 10932 34.16 31.96 98.30 Yes Benhenda RL 60 32000 17157 53.62 31.64 94.21 Yes

Table 4.1: Generative models

• Biased: “Yes” if the model was trained to produce molecules with spe-cific properties, “No” if it was trained to reproduce molecules present in some large set such as ChEMBL, PubChem or Zinc.

The fact that most of the sets mentioned above are trained to produce actives against target, instead of trying to create unspecific molecules, limits the significance of some comparisons that follow, but we include them nev-ertheless. We refer to these sets as being “biased”. As we based our work on that of (Segler et al. 2017) we chose the training objectives to be the same to facilitate comparison.

One can see large differences between the compared sets from Table 4.1 alone. It can be seen that the fraction of valid molecules ranges from about 43 to nearly a hundred per cent. The mean length of the SMILES strings also varies significantly. Another interesting fact is that some models repeat themselves quite a lot resulting in a low fraction of unique molecules in the generated sets. This again might be due to the fact that they are biased. Length distribution of sampled molecules

An obvious way to check if the generated molecules resemble the training data is to look at the length distribution of the created SMILES strings. In Figure 4.7 the length distributions of SMILES generated by smilesLSTM+and by the one trained in Segler et al. 2017 are depicted. In each of the two plots the distributions of five sets of molecules are depicted:

(40)

• all: the entirety of generated SMILES

• valid: the generated SMILES that were valid according to rdkit • invalid: the generated SMILES that were invalid according to rdkit • canonical: canonical SMILES obtained from valid SMILES by

canoni-calization using rdkit

One can see that the model we trained matches the length distributions of the training set quite well. The lines for the valid (red) and canonicalized (orange) molecules overlap very strongly. The green line indicates that long SMILES are more likely to be invalid, as the distribution is shifted to the right. Due to the small fraction of invalid molecules the “all” line is not very different from the “valid” line. One can see that in the lower part of the figure, that the model of Segler et al. (2017) does not perform as well. The distribution of the canonicalized SMILES does not match the one of the ChEMBL compounds very well. It is shifted to the left quite strongly. Possible reasons for this are discussed in Section 4.5. Again longer SMILES are more likely to be invalid. Another interesting thing one can notice is that canonicalization has a much bigger impact on the length distribution here than above. On average the valid molecules are shorter after canoni-calization. This might be caused by the fact that this model was trained on SMILES canonicalized using CDK (Steinbeck et al. 2003) instead of rdkit. FCD of sampled molecules

We also compared all of the models using the Fréchet ChemblNet Distance (FCD), which will be introduced in detail in Chapter 5. For now it suffices to say that the FCD can be used to measure the difference between two sets of molecules by comparing the activations of a neural network that predicts bioactivities. The FCDs between the generated molecules and the ones present in ChEMBL/PubChem/Zinc are shown in Table 4.2. One can see that smilesLSTM+ clearly outperforms SeglerLSTM in the task it was trained on, which is to create molecules similar to the ones in ChEMBL. The distances to the rest of the generated sets, is way larger, which could be due to the fact that they are biased. The distances to PubChem and Zinc molecules are shown for completeness sake and to see if there are any striking differences as opposed to ChEMBL. All distances are calculated after canonicalizing all sets using rdkit.

4.5 Differences to Segler et al.’s approach

In this section we want to describe in detail the differences of my approach to the one presented in (Segler et al. 2017). My model is heavily inspired

(41)

4.5. DIFFERENCES TO SEGLER ET AL.’S APPROACH 0 0.02 0.04 Densit y

smilesLSTM+ ChEMBL valid

invalid canonical all 0 20 40 60 80 100 120 140 0 0.02 0.04 SMILES length Densit y

SeglerLSTM ChEMBL valid

invalid canonical

all

Figure 4.7: Length distribution of generated SMILES.

smilesLSTM+ SeglerLSTM Olivecrona canon agent Olivecrona reduced agent Benhenda Organ 30 Benhenda Organ 60 Benhenda RL 40 BenhendaRL 60 ChEMBL 0.30 2.74 59.83 66.24 56.06 93.74 91.53 124.90 PubChem 3.58 3.06 65.36 73.03 61.47 99.74 98.55 132.19 Zinc 4.87 5.37 55.18 61.50 58.27 93.50 87.52 120.67

Table 4.2: FCDs between molecules found in ChEMBL/PubChem/Zinc and the ones generated by various generative models.

(42)

by theirs but we added, what we think are, several improvements. Firstly we changed the simple, character based one-hot encoding, by parsing multi-character grammatical SMILES atoms into single tokens. For example, Br is presented to the network as one token instead of two. This removes some complexity of the learning task. Encoding chlorine as a C and l token might confuse the network as it cannot tell whether the C represents a carbon or a chlorine atom.

Secondly we did a hyperparameter search. Though it turned out Segler et al. chose a good configuration in terms of number of layers, hidden units per layer and RNN cell type this was in no way obvious.

I think the reason that affects the differences in length distribution and FCD the most is that they used truncated backpropagation. In this approach the input file is a regular text file with one SMILE in each line. The file is processed 64-characters at a time, without regard for molecule boundaries (line breaks). For training, those 64-character strings are then fed to the net-work in minibatches, where the newline characters (\n) denote the beginning of a new molecule. One of those strings might cover multiple molecules or conversely one molecule might be spread over multiple of those input strings. This leads to multiple issues: In some cases there is too much information encoded in the cell state of the network, which happens when more than one molecule are within an unrolled window. Let us assume that there are parts of two molecules are covered by a window and call those parts A and B, where A is presented to the network before B. In the forward pass in-formation about A encoded in the cell state of the RNN is available, when trying to predict tokens of B. This is unwanted behaviour as decisions should only be made using relevant information. Something similar happens in the backward pass while learning: The cell states that are used to predict tokens of A also receives information about B. Again this irrelevant information during optimization leads can have a negative impact on learning, because the network tries to find regularities in the presented data. If the this data is irrelevant however effort is wasted.

While in these cases too much information is present the opposite can also happen. SMILES-strings longer than 64-characters are necessarily cut into multiple strings before being fed into the network. This means that multiple backward passes are necessary to process one molecule in such a case. If preceding parts of a molecule are not processed in the same backward pass, information that might be relevant for predictions is not present. For example, a ring or a branch could have started in another window and the network has no way of knowing that it should end it.

Another fault that comes with this implementation is that the states are never reset when sampling. It might be that the \n-character,used to denote the beginning of a new molecule, might take some responsibility over doing this, but it is not the cleanest method.

(43)

4.6. CONCLUSIONS

4.6 Conclusions

We have reproduced the result that RNNs are capable of creating syntacti-cally correct SMILES strings. We studied the influence of different hyper-parameters using multiple evaluation methods including a new one, namely the Fréchet ChemblNet Distance. We found that LSTM-networks outper-formed GRU-networks in fraction of valid generated SMILES, cross-entropy and FCD. When looking at these three metrics, we noticed that the best models always had 1000 hidden units per layer, but that the best fraction of valid SMILES was obtained by a network with three layers, the lowest cross-entropy by one with two and the lowest FCD by one with only one layer. To the best of our knowledge smilesLSTM+ outperforms other models when trying to generate a set of molecules that resembles the ones found in the ChEMBL database, when using the FCD as a quality measure.

(44)

(45)

Chapter 5 Fréchet ChemblNet Distance

In this chapter we introduce a new evaluation metric that can be used to compare the quality of generative models for molecules. We propose an ap-proach similar to the one used in (Heusel et al. 2017) to rate the performance of image generating models. Our new method, the Fréchet ChemblNet Dis-tance (FCD) is based on comparing the distributions of activations of a neural network that is trained to predict bioactivities. We show that the FCD can be used to detect between sets of molecules.

5.1 Introduction

Recently multiple machine learning methods have been put forward that are able to generate novel molecules. To this day, however, there is no established criterion that is used consistently in publications to compare the performance of those generative models. The increased interest in this field necessitates a common evaluation criterion to allow comparative research.

The Fréchet Inception Distance (FID)(Heusel et al. 2017) is used to rate the performance of image generating models. In this chapter we try to es-tablish an analogous method for evaluating generative models for molecules. The core idea of the FID is to compare the distributions of real and artificial samples, but there is a big problem with this approach: The distribution of an image set in terms of single pixels is very complex and it is hardly possible to get a meaningful comparison in this domain. To overcome this problem the FID uses the Inception network (Salimans et al. 2016), a discriminative network trained on the ImageNet dataset (Russakovsky et al. 2014). In the process of learning to classify images this model has learned how to map the low-level features, such as pixel values and edges, to high-level concepts that can signal the presence of, for example, eyes, persons or dogs. While the low-level features are encoded in the first layers of the network the high-level features are represented in the deeper layers (Gatys, Ecker, and Bethge

Data Driven Molecule Generation Using Deep Learning / submitted by Philipp Renz

Data Driven Molecule

Generation Using Deep

Learning

Master Thesis

Master of Science

Bioinformatics

Contents

Chapter 1

Introduction

Chapter 2

Machine Learning Background

2.1

Supervised learning

2.2

Artificial neural networks

2.3

Optimization of ANNs

2.4

Recurrent neural networks

2.5

Reinforcement Learning

Chapter 3

Life science background

3.1

SMILES Strings

3.2

Drug discovery

Drug Discovery &

Development-Timeline

Chapter 4

Unfocussed molecule

generation

4.1

Introduction

4.2

Previous methods

4.3

Methods

4.4

Experiments

4.5

Differences to Segler et al.’s approach

4.6

Conclusions

Chapter 5

Fréchet ChemblNet Distance

5.1

Introduction