• Keine Ergebnisse gefunden

The parameters (φ, θ) are learned together end-to-end by minimizing the reconstruc-tion error with L2 loss as stated in equation (11), leading to the empirical risk

LAE = 1 N

N

X

i=1

L2(x(i), fθ(gφ(x(i)))) = 1 N

N

X

i=1

(x(i)fθ(gφ(x(i))))2, (33) where as optimization algorithm stochastic batch gradient descent and backpropa-gation (see Section 2.2.5.2) for updating the encoder and decoders network weights can be applied to minimize this empirical risk.

Once the autoencoder is trained, the latent code is often used for various down-stream tasks such as supervised learning by taking it as input feature for a predictive model. In case of a clustering task, the clustering can also be applied in the latent space. Therein the main idea is that by compressing the feature representation into the latent code, samples might bedisentangled and lie within different groups in the encoded latent space.

Since many variants of autoencoder exist and are a field of active research, a sum-mary of the fundamental variants and extensions is provided by Weng (2018) and Goodfellow et al. (2016).

2.3.1 Translation Model to Learn Molecular Descriptors

The objective by Winter et al. (2018) is to learn informative molecular descriptors (see Section 2.1) from low-level molecular encodings such as SMILES or InCHI. In contrast to the basic idea of autoencoders, where the autoencoder has the purpose to reconstruct its input, Winter et al. (2018) borrow ideas from neural machine translation [Seq2Seqmodel by Sutskever et al. (2014) to translate between English and French text.]: it translates between two semantically equivalent but syntacti-cally different representations of molecular structures, compressing the meaningful information in both representations in a low-dimensional representation code vector, calledcddd (Continuous Data-Driven Descriptor).

For example, one possible translation model would receive as input an InCHI rep-resentation of a compound, encode it into the latent space, which is the desired molecular descriptor, and then decode that molecular descriptor to the canonical SMILES representation of the respective compound as displayed in Figure 29.

Figure 29: General architecture of a translation model using the example of translating between the InCHI and SMILES representation of 1,3-Benzodioxole. Source: Winter et al. (2018)

The translation model was trained on a large dataset of approximately 72 million compounds. Since the translation model works with sequential data, tokenization of sequences into one-hot vector representations was done as illustrated earlier in Figure 25. By defining a lookup-table for both, SMILES and InCHI vocabulary, the SMILES vocabulary consists of 38 unique characters and the InCHI vocabulary of 28 unique characters. The translation model itself comprises two neural networks as shown in Figure 29. For implementation details and network architectures please refer to the supplementary information (SI) from Winter et al. (2018).

Once the translation model is trained, the molecular descriptor can be extracted for any compound and utilized as molecular descriptor for several downstream tasks, such as predictive modeling in quantitative structure-activity relationships (QSAR) tasks. Since the goal was to learn good feature representations that could be used for further downstream tasks, Winter et al. extended the translation model with an ad-ditional predictive model forecasting nine continuous molecular properties, a∈R9, which contributes into the overall loss function. By including this additional model,

the translation model is forced to learn meaningful continuous representations.

The predictive (regression) model is a three-layer fully connected neural networkdη that takes as input the molecular descriptorcdddand outputs a molecular property vector of dimension nine. It is trained simultaneously with the translation model.

The encodergφ and decoder networkfθ are both RNNs as illustrated in Figure 30.

Figure 30: The final model architecture comprises the translation model, where the encoder and decoder network each use three-stacked GRU layers [Cho et al. (2014)] with sizes 512,1024,2048.

Additionally, the prediction network is included. Source is modified from SI: Winter et al. (2018)

To explain the translation process further, the encoder RNN gφ takes as input the one-hot encoded token/character at timestept, computes the hidden state for each of the three GRU layers and maps the concatenated (hidden) cell states from the three GRU layers (colored blue in Figure 30) of gφ to one fully-connected layer, which then outputs the molecular descriptor as 512−dimensional vector activated with tanh function. The decoder RNN fθ takes as input the latent cddd-representation and maps it into one fully-connected layer of a size 512 + 1024 + 2048 = 3548, where the activated neurons of this layer are used to initialize the hidden states for the three recurrent GRU layers of the decoder (colored orange in Figure 30). Since the translation model is a Seq2Seq-autoencoder, the decoder network predicts the class probability for each character of the SMILES vocabulary at timestep t, as the input for the encoder was also a token at timestep t. Hence, the decoder network needs as input the one-hot encoded token at timestep (t−1) and the initialized hidden states from the processed cddd-embedding in order to predict the character at timestep t. The hidden state from the last GRU layer is mapped to an output layer to predict probabilities for the different tokens via one fully-connected layer with softmax activation function similar to the model by H. S. Segler et al. (2017) explained in Section 2.2.6.2. The complete translation model is trained on minimiz-ing cross-entropy between this probability distribution and the one-hot transformed correct characters in the target sequences, stating the translation lossLφ,θ between encoder and decoder,as well asminimizing the mean-squared error for the property predictionLφ,η, via the prediction network dη.

The total empirical risk containing cross-entropy andL2 loss is defined with L=−1

N

N

X

i=1

nv

X

j

y(i)j log(ˆyj(i))

+1 N

N

X

i

a(i)dη(gφ(x(i)))2, where ˆy(i) =fθ(gφ(x(i))) (34) and a(i) is the i-th molecular property vector of a compound. Recall that this function is minimized w.r.t. φ, θ and η. Since dη inputs a cddd, when backpropa-gating errors, useful gradient information can be passed to the encoder networkgφ, adjusting its parameters to create better molecular descriptors. This enforces that the translation model, besides performing well in translation, is also well suited to extract meaningful molecular descriptors from the input sequencex.

The overall best translation model Sml2canSml, also considering the predictive modeling objective, is achieved when translating SMILES to their canonical form.

Figure 31: Performance of the best model on four different translation tasks during the first 20000 training steps. The Sml2canSml* run was trained without the additional predictive model dη. (a) Translation accuracy, (b) Mean performance on the lipophicity regression task, (c) Mean performance on the Ames (bioactivity) classification task. For (b) and (c), the translation model at the respective step was utilized to extract the molecular descriptorcdddand fed into a SVM to model both tasks (on a QSAR validation set). Source: Winter et al. (2018)

Figure 31 shows the comparison for different translation tasks (from which sequence type to translate from and to) regarding the translation accuracy and secondly displays the predictive performance of the molecular descriptor on two additional validation tasks. Regarding the translation accuracy, if no predictive modeling vali-dation task is considered, the translation from SMILES to canonical SMILES as well as InCHI to SMILES performs good and the pure autoencoding task from canonical SMILES to canonical SMILES performs best. However, when looking at the valida-tion tasks in Figure 31a and 31b, the pure autoenconding task leads to molecular descriptors that are not well suited for the two predictive modeling tasks.

This strengthens the initial idea that the translation between two syntactically differ-ent sequences enforced the translation model to capture the ‘true’ molecular essence that both input and output sequences have in common.