• Keine Ergebnisse gefunden

2.7 Dimensionality Reduction

reduces memory requirements and computing time. Moreover, it is a reasonable demand that the mapped objects maintain nearly the whole information compared to the objects in the original feature space. Dimensionality reduction as an unsupervised data-driven downscaling can be regarded a learning task itself or as a tool to solve another learn-ing task as it is the case, for example, in Chapter 5. There are approaches which take multiple views on data into account to calculate an appropriate reduction model, e.g., canonical correlation analysis (CCA). However, the following two methods only consider one single feature space.

2.7.1 Johnson-Lindenstrauss Random Projection

Firstly, we present a random projection technique for dimensionality reduction based on the work of Dasgupta and Gupta [2003]. We consider an arbitrary but fixed data matrix Φ(X)∈Rn×D, such that Φ(x1)T, . . . ,Φ(xn)T are the rows according to Equation 2.2 and Φ :X →RD is the feature map for instances from X. The Johnson-Lindenstrauss (JL) [Dasgupta and Gupta, 2003] lemma states that for a well-defined projection mapping f :RD →Rd,d≤D, the distance between instances fromX remain approximately the same in the image space compared to the distance in the initial feature space. More formally, for two instances x, x0 ∈X

(1−ε)kΦ(x)−Φ(x0)k2 ≤ kf(Φ(x))−f(Φ(x0))k2 ≤(1 +ε)kΦ(x)−Φ(x0)k2 (2.35) holds true, where 0< ε <1 is a small error bound. For more details on the preconditions of the mappingf and the proof of the JL lemma we refer to Dasgupta and Gupta [2003].

As it will be applied in the empirical section of Chapter 5, we present an example for a precise projection f that fulfills the requirements of the JL lemma. We consider the data matrix Φ(X)∈Rn×D which we intend to map to a lower dimensiond. Indeed, the mapping

f(Φ(x)) = 1

dPTΦ(x), (2.36)

such that P ∈ RD×d and Φ(x) ∈ RD consists of Bernoulli random variables with a probability of success of p = 0.5 is a valid JL projection [Baraniuk et al., 2008] if d∈ O((lnn)ε−2) holds true.

2.7.2 Principal Component Analysis

For the introduction of the second unsupervised approach we explicitly consider the feature map Φ : X → RD, where D is the dimension of the initial feature space for instances. The cumulative variance contained in the feature space components can be used as indicator of the information content or to monitor the information loss in the case of dimensionality reduction. Therefore, the idea of principal component analysis (PCA) [Sch¨olkopf et al., 1997] is to learn an orthogonal transformation of the feature space such that the resulting projection of data keeps as much intrinsic variance as possible in decreasing order of the resulting components [Sch¨olkopf and Smola, 2002].

This demand can be formulated as an eigenvector problem which we will introduce briefly in the following.

2.7 Dimensionality Reduction We assume the data representations to be centered, i.e., the mean of every feature space component is supposed to be zero. Let Φ(X)∈Rn×D be the data matrix in the initial feature space corresponding to instancesx1, . . . , xn∈ X

Φ(X) =

ΦT(x1) ... ΦT(xn)

,

where Φ(x) ∈ RD for all x ∈ X. Now we aim at a projection matrix P ∈ RD×d such that the projected data

Φ(X)P ∈Rn×d

exhibits the desired properties of maximal variance within a smaller number d≤ D of projection image components. It turns out that the columns p of P are actually the eigenvectors of the empirical covariance matrix

C= 1

T(X)Φ(X)∈RD×D

according to the eigenvector-eigenvalue equation Cp = λp, where λ is an eigenvalue.

This eigenvector problem can be solved as an ERM problem according to Equation 2.7.

For the precise formulation and more details consult Sch¨olkopf et al. [1997] and Sch¨olkopf and Smola [2002].

With regard to the kernelised PCA formulation, letk:X × X →Rbe a kernel function with canonical feature map Φ(x) = k(x,·) as well as k(x, x0) = hΦ(x),Φ(x0)i for all x, x0 ∈ X. If p is an eigenvector solution of the PCA optimisation, as a consequence of Theorem 2.21 there are coefficients π1, . . . , πn ∈R such that p has a representation in form ofp= ΦT(X)π. Consequently, we obtain a kernelised formulation via

Φ(X)P = Φ(X)ΦT(X)Π =KΠ, (2.37)

where K = Φ(X)ΦT(X) ∈Rn×n is the Gram matrix and Π ∈Rn×d a projection of K.

If the inverse K−1 exists we conclude λp=Cp

λ(ΦT(X)π) =C(ΦT(X)π) λΦ(X)(ΦT(X)π) = Φ(X)C(ΦT(X)π)

λΦ(X)ΦT(X)π = 1nΦ(X)ΦT(X)Φ(X)ΦT(X)π λKπ= 1nK2π

nλπ=Kπ.

Hence, the kernelised PCA algorithm is a modified eigenvector problem and its result Π is a projection of the kernel matrixK. Let (γ,π) be a pair of eigenvalue and corresponding˜ eigenvector ofK. The scaled eigenvectorsπ = ˜π/√

γ build the columns of Π in Equation 2.37. The scaling of the eigenvectors ˜π arises from the requirement of an orthonormal basis in P from above, where 1 = pTp is necessary. The final number of columns d≤min{n, D} must be chosen depending on the practical purpose and the data itself.

The PCA approach for dimensionality reduction can be applied with arbitrary kernel functions as the feature vectors are only required in form of inner products. For more

details on the derivation of PCA and its properties we refer to Sch¨olkopf et al. [1997, 1998], Sch¨olkopf and Smola [2002] and Shawe-Taylor and Cristianini [2004].

An equivalent formulation of the PCA eigenvector-eigenvalue problem will be used in Chapter 5. Let M ∈ Rd×d be an arbitrary symmetric matrix. Hence, M has got a so-calledeigenvalue decomposition [Werner, 1995] in form of

M =U DUT, (2.38)

whereU ∈Rd×d is a unitary matrix with columns equal to the eigenvectors of M. This decomposition into U and Dis denoted withdiagonalisation. The factorD∈Rd×dis a diagonal matrix such that the eigenvalues ofM are the corresponding diagonal elements.

Equation 2.38 is equivalent withUTM U =D. Regarding the reformulation of the PCA optimisation we exploit the fact that the value of

max

U∈Rd×d0

tr(UTM U) (2.39)

s.t.UTU =Id0

is reached if the columns of U are the eigenvectors corresponding to the d0 largest eigenvalues ofM [Werner, 1995]. Hence, the maximal value is the sum of thed0 largest eigenvalues.

Chapter 3

Multiple Kernel Learning

We introduced ligand affinity prediction as an important problem from chemoinfor-matics in detail in Section 1.3.4 of the introductory chapter. In order to support the expensive identification of ligand affinities in practice and to plan experiments efficiently, machine learning methods can be used to predict affinity values via computational al-gorithms in the context of similarity-based virtual screening. SVR utilising a molecular fingerprint is the state-of-the-art method and was already tested successfully [Liu et al., 2006, Sugaya, 2014, Balfer and Bajorath, 2015]. This supervised approach for regression employs labelled instances in vectorial format in order to train an inductive model for future instances. Many publicly available or commercial fingerprint descriptors for small molecules exist. These fingerprints list (or count) diverse physico-chemical properties of the respective molecule, structural properties of their molecular graphs in 2D, or even 3D information [Bender et al., 2009]. The variety of data descriptions here is both a blessing and a curse. On the one hand, there are many different data representations available, which were originally designed towards different purposes. On the other hand, the va-riety of representations implies the need for a choice of the optimal one for the affinity prediction task. In the first instance, it is not obvious which molecular representation is optimal for a considered prediction task from chemoinformatics [Fr¨ohlich et al., 2005]. A branch of chemoinformatics research considers fingerprint reduction and recombination techniques to design an optimal fingerprint and select the most informative features for prediction [Willett, 2006, Nisius and Bajorath, 2010, Heikamp and Bajorath, 2012].

In contrast to the described approaches, in the present chapter we investigate strategies to deal with the variety of descriptors by including multiple representations simultane-ously. To this aim, a group of multi-view learning approaches named multiple kernel learning (MKL) trains a linear combination of predictor functions such that each func-tion is related to a particular representafunc-tion of data using labelled training data (see Section 1.2.3). Although multiple views are not completely new to chemoinformatics (compare Section 1.3.6), the application of supervised MKL approaches in combination with a systematic choice of graph patterns is novel in the field of ligand affinity pre-diction. These MKL approaches will be the first group of multi-view learning methods which we investigate in this thesis. Beyond our focused task of affinity prediction, the proposed approaches below are generally applicable for learning tasks with

• instances that can be interpreted as graphs,

• multiple data representations with appropriate similarity measure (kernel func-tions),

• a real-valued label, and

• sufficient labelled examples.

The following real-world applications display examples which illustrate the described learning scenario.

Example 3.1. (Drug discovery) The interaction of chemical substances with each other, such as the binding of a small molecule to a protein, needs to be tested practically in a time- and cost-consuming process. However, the efforts made in this research field are justified by the fact that many drugs act as protein ligands. Documented laboratory results meanwhile led to the formation of huge molecule databases with ligands and their respective protein affinity. Various kinds of molecular fingerprint descriptors have been developed (see Section 1.3.4) and can be used to represent small molecules differently.

This information can be used to learn binding models of proteins in supervised algorithms.

Example 3.2. (Temperature forecast) The development of climate has now been recorded for decades in great detail and nearly on a worldwide basis. In view of a dra-matical increase of the temperature on earth its forecast is no longer only important for the weather of the following days. The temperature at a certain location in the world is recorded together with a variety of characteristic information, such as physical infor-mation (air pressure, humidity, cloudiness, wind strength and direction), geographical information (soil conditions, temperature zone, and vegetation), local information (GPS position, height, hillside situation), different wavelength sensors from satellite data, and neighbourhood information. Using multi-view learning the temperature can be forecasted taking various information sources on climate and environment into account.

Example 3.3. (Condition monitoring and predictive maintenance) In assem-bly and production a number of input parameters describe the process conditions. Ad-ditionally, accompanying equipment like microphones or acceleration sensors record the progress of the production process and the quality of the involved tools and products. To the aim of a maximal product quality and resource efficiency the prediction of present and future tool condition parameters is an important application of supervised multi-view algorithms.

Apart from the algorithmic aspect of this chapter, we additionally address the topic of view generation and analysis for graph data, such as small molecules. Actually, both structural and neighbourhood information are crucial for the capacity of small molecules to be a ligand and for the strength of the binding [Ralaivola et al., 2005, Ga¨uz`ere et al., 2014]. For example, the presence of a benzene ring or that of an alcoholic group and their relative positions influence the chemical properties of the compound at hand.

None of the existing fingerprints that collect structural information, however, captures both all circular and tree patterns of the molecular graph independent of size and the adjacency and connectivity information of atoms within the graph structure. To the aim of an optimised graph representation for the practical task at hand we propose to investigate and systematically combine graph kernels that incorporate relevant patterns for structural and neighbourhood information. To be more precise, we consider the feature set of thecyclic pattern kernel (CPK) [Horv´ath et al., 2004, Horv´ath, 2005], the

3.1 Graph Kernels