• Keine Ergebnisse gefunden

learning task from X. The optimisation (h1o, . . . , hMo ) = argmin

hv∈Hv

M

X

v=1

"

νvkhvk2H

v+

n

X

i=1

hhv, hviiHvp

kT(ti, ti)−kT(to, ti)khvikHv2#

M

X

u,v=1 m

X

j=1

(hu(zj)−hv(zj))2

is called co-regularised corresponding projections (CoCP), whereλ, νv >0 are hyperpa-rameters and the final predictor for the orphan target is the averageho = 1/MPM

v=1hvo. The objective of CoCP is equipped with an additional regularisation term utilising un-labelled instances which hopefully results in an orphan hypothesis ho with improved predictive performance.

Lemma 5.9. With the preconditions of Definition 5.8 as well as Equations 5.25 and 5.26, the system of equations

D1 −2λU1TU2 . . . −2λU1TUM

−2λU2TU1 D2 . . . −2λU2TUM

... ... . .. ...

−2λUMTU1 −2λUMTU2 . . . DM

 πo1 πo2 ... πMo

=

 G˜1ρ˜1o2ρ˜2o

... G˜Mρ˜Mo

 .

delivers the solution of CoCP.

Proof. The reasoning of the proof is equivalent to the one of Lemma 5.7.

Analogous to the single-view case, it depends on the prerequisites on the candidate spaces and the relation between the number of supervised targetsnand the number of training instances N whether MVCP or CoCP is to be favoured. The computation complexity is O(M n) for MVCP andO(M N) for CoCP as a consequence of the respective matrix inversion.

5.4 Empirical Evaluation ligands annotated with their binding affinity towards the protein aspKi-value (compare Section 1.3.1 on the biochemical background). More details on the datasets can be found in Appendix B. Hence, we have a data matrix Φ(X) ∈Rn×D of n= 8928 ligands in a feature space RD induced by the feature map Φ. The experimental framework and all figures were generated with Python 2.72, Jupyter Notebook [Kluyver et al., 2016] and Matplotlib [Hunter, 2007].

For the real-world learning task of orphan screening the feature map Φ is a vectorial representation of small molecular compounds fromX. We apply the standard molecular fingerprints ECFP4 and GpiDAPH3 (compare also Section 1.3.2). Additionally, we con-sider 2 combined variants of the fingerprint formats ECFP4 and GPIDAPH3. Firstly, we use a concatenation of the respective ECFP4 and GpiDAPH3 fingerprint vectors to a final vectorial representation of length D = 30812 called Concat. The second com-bined fingerprint was obtained by a Johnson-Lindenstrauss (JL) projection according to Section 2.7.1 applied to the concatenated fingerprint Concat and will be denoted JL-Concat. In order to obtain the JL property from Equation 2.35, we chose an image dimension of d = 1000 and a number of instances n = 8928 such that with an error bound ofε= 0.1 the dimensiondis approximately (lnn)ε−2 according to Section 2.7.1 above. Furthermore, we generated the random projection matrix P ∈ Rd×D such that fori= 1, . . . , dand j= 1, . . . , D

(P)i,j =

( −1·1

1000 : with probability p= 0.5 1·1

1000 : with probability (1−p) = 0.5

to satisfy Equation 2.36. For more details on the choice of P consult Section 2.7.1 on JL random projections. The JL projection-based ligand representations JL-Concant pursues with the idea of information transfer based on projections. In contrast to the Concat representation, JL-Concat induces a baselines approach with a low-dimensional feature space.

We test and compare CP in its NLCP implementation from Section 5.3.3 with baseline approaches applied to the learning problem of orphan screening. For the sake of sim-plicity we will refer to the algorithm as CP. An overview of the considered baselines can be found in Table 5.1. WithSCP we refer to the weighted sum of supervised hypothe-ses defined in Equation 5.11. The TLK predictor assigns an affinity value to pairs of targets and ligands and is described in detail in Section 5.2. For the experiments each of the 9 protein targets is assumed to be the orphan target and the respective other 8 targets serve as supervised targets. In contrast, the TLK variant TLK-Clo-3 only incorporates the 3 closest targets of the orphan proteinto as supervised targets. In this context, closest (farthest) refers to the protein with the biggest (smallest) similarity or kernel value compared to the orphan protein. Given a fixed orphan target, withAvg we refer to the average predictor of the respective other 8 supervised targets. Analogous to TLK-Clo-3, the Avg-Clo-3 algorithm only incorporates the 3 respective closest pro-teins for the average predictor. The baselines Closest Protein and Farthest Protein use the supervised hypothesis of the closest and farthest supervised hypothesis, respectively.

With Supervised-l% we refer to standard SVR with l%, l ∈ {5,10,30,50,80}, labelled training data. SVR is in a sense an optimal but unfair baseline as for an orphan target there is actually no labelled data available. We oppose the performance of the described algorithms using the standard molecular representation formats ECFP4 and GpiDAPH3

2https://www.python.org/

to the performance results obtained with the fingerprints Concat and JL-Concat. These combined fingerprints add a canonical multi-view approach to our experiments.

In order to simulate the real-world scenario of orphan screening, each of the 9 proteins was assumed to be the orphan target once and the respective 8 others the supervised targets. For a fixed orphan target, we drew 240 ligands from each of the remaining sets and repeated this procedure 10 times. We report RMSE values (compare Section 2.2) to evaluate the regression performance of the different algorithms and averaged over the 10 folds for every orphan target. The choice of ν in Equation 5.15 posed a problem as a labelled training set of 8 supervised targets and corresponding supervised hypotheses for the general assignment of hypotheses from H to targets from T was not sufficient to perform a reasonable hyperparameter tuning procedure. We observed in preliminary experiments that the results were barely affected by the choice ofν. Therefore, we fixed ν = 5.0 for all orphan targets. Furthermore, we introduced a small modification in the objective of NLCP in Equation 5.15 with βo = [νG+λIn+GN G]o and λ = 1.0.

The summandλInis an additional regularisation term and ensures the existence of the inverse [νG+λIn+GN G]−1 if λ is large enough. For the initial training procedure of the supervised hypotheses for supervised targets we applied a 3-fold cross-validation.

The hyperparameters ν and εof the SVR algorithm were optimised within the ranges ν ∈ {2−5,2−4, . . . ,25} and ε ∈ {0.1,0.01,0.001}. For a fair comparison between CP and baseline results, we applied our own SVR implementation based on Definition 2.24 for the determination of supervised hypotheses and the Supervised-l% baseline. For the ligands, we used a linear kernel applied to the standard and combined fingerprint representations. The similarity (target kernel) values kT for CP according to Section 5.3.3 were derived from a positive semi-definite similarity matrix for proteins. The contained similarity values were calculated based on amino acid sequence similarity measures and normalised.

5.4.2 Results

The experiments in the present empirical section augment the results presented in the work of Giesselbach et al. [2018]. Figure 5.2 shows two boxplots with the RMSE results for CP and all baselines from Table 5.1. The RMSEs are averaged over all 9 proteins and all 10 ligand draws for the supervised models. We report averaged RMSEs for the

Name Description

Simplified SCP approach from Equation 5.11 TLK TLK approach from Section 5.2 TLK-Clo-3 TLK approach from Section 5.2 Avg Average of supervised predictors Avg-Clo-3 Average of supervised predictors Closest Protein Predictor of the closest protein Farthest Protein Predictor of the farthest protein

Supervised-l% Standard SVR withl% of labelled data Table 5.1: Overview of baseline approaches

5.4 Empirical Evaluation standard molecular fingerprints ECFP4 (a) and GpiDAPH3 (b). To start with, we re-alise a generally worse performance of all approaches with GpiDAPH3 in comparison to the application of ECFP4. In both cases, CP outperforms the Avg baseline which does not make use of inter-target similarities at all. For the ECFP4 fingerprint, CP also beats all other baselines that make use of these similarities exhibiting an average RMSE of 2.197. However, this is not the case with respect to the Simplified and Avg-Clo-3 approach if GpiDAPH3 was used. In order to understand how the similarity relation of proteins affects the prediction quality, we compare the CP performance with the perfor-mance of the Closest Protein and Farthest Protein model. The fact that Closest Protein performs much better than Farthest Protein supports the intuition that the molecular similarity principle [Bender and Glen, 2004] does not only hold for small compounds but also for proteins, in particular, for the orphan protein. The molecular similarity principle introduced in Section 1.3.3 states that similar molecules are supposed to have similar properties with respect to binding and vice versa. The modified average model Avg-Clo-3 of the 3 most similar targets compared to the orphan target yields a signif-icant performance improvement both for ECFP4 and GpiDAPH3. Again, the orphan protein’s binding model obviously profits from the focus on closer proteins. Addition-ally, we compare with the state-of-the-art approach TLK for orphan screening which incorporates both target and ligand similarities. It was introduced in detail in Section 5.2 and basically solves a supervised problem in terms of target-ligand pairs. CP out-performs TLK for both fingerprints. However, an advantage of the TLK approach is that no supervised hypotheses have to be learned for proteins with training information in a preliminary step. Regarding TLK-Clo-3 we could not show the positive effect of emphasising closer targets during model calculation which we have seen for Avg-Clo-3 versus Avg. The approach closest to CP in terms of RMSE is the Simplified approach.

Therefore, depending on the precise learning task at hand, it might be a valuable alter-native to CP because of its shorter running time. Supervised-l% denotes the standard supervised SVR algorithm which usesl% of the available data as labelled training exam-ples. As CP operates in the learning scenario of no labelled training information for the orphan target, Supervised-l% outperforms CP as expected. We pursued the experiments with CP and baselines for orphan screening using 2 combined fingerprints as canonical multi-view representations of small molecular compounds. In Figures 5.3 (a) and 5.3 (b) we observe that the considered algorithms show a very similar performance in applying the combined variant Concat and JL-Concat compared to the ECFP4 fingerprint (see Figure 5.2 (a) above). This will be discussed in the following section.

5.4.3 Dicussion

Orphan screening is a challenging and important real-world learning problem. More precisely, we investigated the task of ligand affinity prediction for a protein with no labelled training compounds. We defined CP and variants of it as a novel kernel method to master this unsupervised problem. The approach of CP is to firstly derive protein-ligand binding models for protein targets with labelled training data. Secondly, with further information about the relations between protein targets the knowledge about protein-ligand binding is transferred to the orphan target. For this reason, CP can be assigned to transfer learning or multi-task learning as well. Supervised learning algorithms are based on labelled data and its results degrade typically with a decreasing number of labelled examples. For both the standard molecular fingerprints ECFP4 and GpiDAPH3 and the combined representations Concat and JL-Concat we observed that

Figure 5.2: RMSEs of CP and baselines averaged over all proteins and draws using fingerprint ECFP4 (a) and GpiDAPH3 (b)

(a) (b)

Figure 5.3: RMSEs of CP and baselines averaged over all proteins and draws using the fingerprints Concat (a) and JL-Concat (b)

(a) (b)

5.5 Future Work: Orphan Principal Component Analysis