• Keine Ergebnisse gefunden

build according to Equation 2.10. Subsequently, from the derivatives of L with respect to the problem variables and to the dual variables corresponding to equality constraints (compare Equations 2.11 and 2.12) one derives a dual formulation, which can be solved efficiently. The additionally added equality and inequality constraints from the applica-tion of KKT condiapplica-tions are necessary for the correct soluapplica-tion of the optimisaapplica-tion problem and the ranges of the solution parameters (compare Section 2.6.2). We will apply the techniques from convex optimisation in order to find appropriate predictor functions.

In order to distinguish predictor functions from objectives, we will denote the objec-tive function in the following chapters with Q, the predictor function with f, and the complexity class symbol with O.

2.5 Kernel Methods Definition 2.16 (Kernel function). [Steinwart and Christmann, 2008] Let X be a set of instances. The function k:X × X →Ris called a kernel, if there is a Hilbert space Hand a mapping Φ :X → H such that

k(x, x0) =hΦ(x),Φ(x0)iH (2.16) for allx, x0 ∈ H. The function Φ is called feature map and the spaceH in this context is called feature space. The matrix

K = k(xi, xj)n

i,j=1 (2.17)

is the Gram matrix of kernel kwith respect to instances x1, . . . , xn∈ X.

Actually, for a given kernelkthere are infinitely many isometric isomorphic feature maps Φ and Hilbert spaces H such that Equation 2.16 holds true [Minh et al., 2006]. In the case ofH=Rd, the name feature space is very intuitive as instances x∈ X are mapped tod feature values Φ(x). This leads us to the definition of a view as a central concept of this thesis. It formalises the fact that a view on data is basically a particular feature map of data instances.

Definition 2.17 (View). Let X be an arbitrary instance space and Ha feature space.

The representation of an instance space by a feature map Φ(X) Φ :X → H

is called aview on data. We will denote a set ofM feature representations

Φ1 :X → H1, . . . ,ΦM :X → HM (2.18) asmultiple views of the instance spaceX. We refer to the respective view by its index v∈ {1, . . . , M}.

In addition to the feature space representation, the property ofpositive semi-definiteness will play an important role for kernel functions.

Definition 2.18 (Positive semi-definiteness). [Steinwart and Christmann, 2008] LetX be an arbitrary instance space. A functionk:X × X →Ris calledpositive semi-definite if and only if

n

X

i,j=1

αiαjk(xi, xj)≥0 (2.19)

for all n∈N,α1, . . . , αn∈R, and x1, . . . , xn∈ X.

In the strict positive case the function is said to bepositive definite. It turns out that the characteristic of positive semi-definiteness and the property of a function to be a kernel function are actually equivalent as a consequence of the following theorem [Steinwart and Christmann, 2008].

Theorem 2.19. A function k:X × X →R is a kernel according to Definition 2.16 if and only if it is symmetric and positive semi-definite.

Table 2.1: Examples of kernel functions

Name Kernel Function(x, x0 ∈Rd) Linear kernel k(x, x0) =hx, x0i

Tanimoto kernel k(x, x0) = hx,xi+hxhx,x0,x00ii−hx,x0i

Polynomial kernel k(x, x0) = (hx, x0i+c)d,c≥0 Gaussian kernel k(x, x0) = exp (−kx−x20k2),σ >0

As an interpretation of their definition, kernel functions can be regarded generalised similarity measures [Lanckriet et al., 2004c, Vert et al., 2004] between two objects x and x0 from an instance space X of interest. In contrast to mathematical measures, a kernel function is in general not normalised. Regardless, we use the expressionsimilarity measure below in order to exemplify the concept of kernel functions. A number of established kernel functions for vectors from Rd can be found in Table 2.1, where k · k and h·,·i are the Euclidean norm and scalar product in Rd. Kernels for graphs objects play an important role in chemoinformatics. An introduction to graph kernels and related work can be found in Section 1.3.3 above and the introduction of Chapter 3 below. The kernel property is preserved under summation of two kernel functions and multiplication of a kernel with a positive constant. Furthermore, the tensor product of two kernel functions is a kernel function again [Steinwart and Christmann, 2008].

Analogous closure properties for Gram matrices are valid. From the definition it is obvious that kernel functions and Gram matrices are strongly related to the concept of covariance functions of random variables and covariance matrices (compare also Section 2.7). For more details on the stochastic interpretation of kernel functions and their origin in the context of integral operators consult the literature on kernel theory [Aronszajn, 1950, Sch¨olkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004, Minh et al., 2006].

The property of a function to be a kernel comes along with the positive semi-definiteness of the corresponding kernel matrices. A symmetric matrix K ∈ Rn×n is said to be positive semi-definite if and only if for all α∈Rn

αTKα≥0 (2.20)

holds true. Hence, a function k is a kernel if all its Gram matrices are positive semi-definite. There are optimisation problems below where the Hessian matrix is equal to the Gram matrix of a kernel function. Consequently, this Hessian matrix is positive semi-definite. The corresponding optimisation has a convex objective function according to Definition 2.8 above. For the resulting convex optimisation problem we can apply the solution techniques from Section 2.4.

As mentioned already above, every kernel function induces a function space. These kind of functions spaces are chosen as candidate space in kernel methods.

Definition 2.20(Reproducing kernel Hilbert space). [Steinwart and Christmann, 2008]

A function k : X × X → R is called the reproducing kernel of the reproducing kernel Hilbert space (RKHS) Hk if and only if

1. k(x,·)∈ Hk for all x∈ X, and

2. f(x) =hk(x,·), fiHk for all x∈ X and allf ∈ Hk.

2.5 Kernel Methods The second property is also known as reproducing property. Finally, it can be shown that the property of a function to be akernel and to be areproducing kernel are indeed equivalent [Steinwart and Christmann, 2008]. The RKHSHk is a feature space of kernel k, which can be seen via the canonical feature map

Φk(x) =k(x,·) , x∈ X, together with the reproducing property, as we may conclude

k(x, x0) =hk(x,·), k(x0,·)iHk =hΦk(x),Φk(x0)iHk

for all x, x0 ∈ X. In contrast to the feature maps and feature spaces, every reproducing kernel has a uniquely defined RKHS and vice versa [Steinwart and Christmann, 2008].

The RKHS Hk of the reproducing kernel k is one of its corresponding infinitely many feature spacesH according to the canonical feature map. For this reason, from now on we will omit the index k in the RKHS as long as the corresponding kernel is obvious.

There is also a subset of reproducing kernels called Mercer kernels which will not be discussed here in more detail.

The already mentioned kernel trick describes the fact that the calculation of a kernel value can be substituted by an inner product in an appropriate linear feature space and vice versa. For this reason, the kernel trick enables the application of principally arbi-trary linear algorithms in feature space. Moreover, it is possible to avoid the calculation of the potentially non-linear and infinite-dimensional feature map Φ if only the neces-sary kernel values are known or can be calculated. Hence, alternative formulations of algorithms can be generated by just exchanging the kernel function by an inner product or another kernel function [Sch¨olkopf and Smola, 2002].

As kernel functions correspond to feature representations in a canonical way, the multi-view scenario from Definition 2.17 is equivalent with having a number of kernel functions

k1 :X × X →R, . . . , kM :X × X →R

available for instances from X. Again, we realise that handling multi-view problems with kernel methods does not require the explicit knowledge of the respective feature representation if one is able to calculate the kernel function.

The subsequent representer theorem supplies us with a convenient solution tool for a class of optimisation problems in kernel methods [Sch¨olkopf et al., 2001, Steinwart and Christmann, 2008]. It guarantees a representation of solution functions as linear combinations of the RKHS’s reproducing kernel. Consequently, the representer theorem is the basis for the elegant solution techniques in the context of support vector machines and related algorithms we will consider below.

Theorem 2.21(Representer theorem). [Sch¨olkopf et al., 2001] We consider an instance space X and examples (x1, y1), . . . ,(xn, yn) ∈ X ×R. Let k : X × X → R be a kernel function and H be the RKHS of kernel k with norm k · kH and inner product h·,·iH. Assume we intend to solve

minf∈H c(y1, f(x1), . . . , yn, f(xn)) +g(kfkH) (2.21) for an arbitrary cost function c : (Y × X)n → R and strictly monotonically increasing regularising function g : R+ → R. Then a solution f of Equation 2.21 has got a

representation in form of

f(·) =

n

X

i=1

πik(xi,·) (2.22)

for appropriate π1, . . . , πn∈R.

Sch¨olkopf et al. [2001] used the termcost function in the sense of a loss function generali-sation in order to express the loss sustained for the labelled examples (x1, y1), . . . ,(xn, yn).

For the relation between cost and loss refer also to Steinwart and Christmann [2008].

Notice that the minimisation in Equation 2.21 is just an RRM according to Equation 2.8. The subsequent proof is a version of the proof by Sch¨olkopf et al. [2001].

Proof. Let f ∈ H be the minimising function of the optimisation in Equation 2.21. We consider the canonical feature map of kernel k with Φ(x) = k(x,·) for x ∈ X and the decomposition ofH into

S = span{Φ(xi) :x1, . . . , xn are the training instances}

and its orthogonal complement S. We may write f always as f = f0+f1 such that f0 ⊥f1 and

f0(·) =

n

X

i=1

πiΦ(xi)(·)∈S

forπ1, . . . , πn∈Rand f1∈S. Because of the reproducing and orthogonality property one concludes for the function values f1(xi) of the training instances xi, i = 1, . . . , n, that

0 =hf0, f1iH=h

n

X

i=1

πik(xi,·), f1iH=

n

X

i=1

πihk(xi,·), f1iH=

n

X

i=1

πif1(xi) (2.23) holds true for every choice of coefficientsπ1, . . . , πn∈R. For this reason,f1(xi) = 0 holds true forx1, . . . , xn∈ X and, hence, the cost functioncin Equation 2.21 is unaffected by f1. Furthermore, as f can be decomposed into the two orthogonal functions f0 and f1, the norm term in the objective function can be written as

g(kfk2H) =g(kf0k2H+kf1k2H).

Again, a non-zero functionf1 would only increase the norm in Equation 2.21, for which reason we conclude the desired representation of f as an element of S.

We will present all algorithms in kernelised form, i.e., we use RKHSs of appropriate kernel functions as candidate spaces. Therefore, we will omit the additional wordkernel in the algorithm’s names. For examples, we use support vector regression instead of kernel support vector regression. Based on the representer theorem, we will formulate kernelised objectives and obtain their solutions with respect to the view predictor func-tions in terms of the kernel linear combination’s coefficients π from Equation 2.22. For a Gram matrixK ∈Rn×n of kernel k

Kπ = (f(x1), . . . , f(xn))T (2.24)

2.6 Single-View Regression