• Keine Ergebnisse gefunden

con-sists in the knowledge transfer drawn from solving a taskT by experienceE to a related task T0 with experience E0. As it is the case in the considered unsupervised learning scenario of affinity prediction in the present thesis, the experienceE0for the related task does not necessarily include the availability of labelled training data. A special case of the supervised learning scenario is denoted with semi-supervised learning. In addition to labelled data, here we expect to be aware of unlabelled data instances, for example, molecules without known affinity value. Although semi-supervised learning is actually a part of supervised learning we will treat supervised learning, semi-supervised learning, and unsupervised learning independently and dedicate each scenario one of the following chapters.

2.3 Learning Theory

Being aware of general ideas of machine learning, we now have to work out in more detail how a model with desirable generalisation properties can be derived. According to Shalev-Shwartz and Ben-David [2014], a successful learner needs to be able to gen-eralise from examples to unseen instances and to have prior knowledge on the scenario (inductive bias). To this aim, we will treat aspects of both statistical learning theory [Vapnik, 1999] andcomputational learning theory. They deal with the general problem of “Given a taskT, experience E, and a performance measure P, how can we derive a good model?” and address questions of the kind “HavingT,E, andP, how difficult is it to derive agood model?” With our explanations we will focus on the case of supervised and semi-supervised regression as performed in Chapters 3 and 4. In some respects, the unsupervised regression problem from Chapter 5 is transformed into a supervised learn-ing task as well. The presented choice of loss functions, risks, and complexity measures is directed towards the regression task we intend to solve and the techniques we apply for this purpose [Sch¨olkopf and Smola, 2002].

2.3.1 Empirical Risk Minimisation

To the aim of solving a machine learning task and finding an optimal prediction model, a loss function is typically applied to measure the appropriateness of a single model.

We will use the nameloss for both the function itself and the loss function’s output.

Definition 2.2 (Loss function). [Sch¨olkopf and Smola, 2002] Let Y be a label space.

The non-negative function`:Y × Y →R+ is called a loss function if

`(y, y) = 0 for all y∈ Y.

In particular, given a data example (x, y) ∈ X × Y and a model f : X → Y. Then the loss function `(y, f(x)) will output zero if the observation y agrees with the model predictionf(x). Assume, there is another modelf0 :X → Y. If the two predictionsf(x) and f0(x) are equal, the loss`(f(x), f0(x)) would output zero as well. For a regression taskT a good modelf will be characterised by a preferably small distance between the prediction value f(x) and the true label y ∈ R of instance x ∈ X. The loss ` should

return a small value in case the prediction was good, and a low value in the opposite case.

Therefore, the respective loss function for regression will take the distance|y−f(x)|into account. Aconvex loss function is a convex function`with input|y−f(x)|, i.e., which is convex in the distance between y and f(x). If necessary, we assume the considered loss functions in the present thesis to have this convexity property. In the following, we present the two distance-based loss functions [Steinwart and Christmann, 2008] for regression that will accompany us through the whole thesis.

Definition 2.3 (ε-insensitive and squared loss). [Steinwart and Christmann, 2008] Let ε >0 be a constant and y1, y2 ∈ Y. Theε-insensitive loss is defined as

`ε(y1, y2) = max{0,|y1−y2| −ε}. (2.3) The function

`2(y1, y2) =|y1−y2|2 (2.4) is known assquared loss.

We postulateεto be greater than zero in order to distinguish it from the absolute loss`abs

below, althoughε= 0 would be possible in the definition of `ε as well. The squared loss function`2 is also commonly known asleast squares loss. Because of its relation to the

`2-norm we use the symbol`2, which should not be confused with the space of sequences

`2. The latter continuously penalises gaps between the two inputs y1 and y2, whereas for the ε-insensitive loss intervals between y1 and y2 smaller than εdo not cause a loss value greater than zero. The different loss functions influence on the final properties of the respective learning algorithms. Absolute distance `abs(y1, y2) =|y1−y2|andsquared ε-insensitive loss `2ε(y1, y2) = max{|y1−y2| −ε,0}2 are other related examples of loss functions for regression.

Having defined a loss function that evaluates predictions for single instances, one would like to assess the overall quality of a prediction model. Assume data examples (x, y)∈ X × Y are generated via the joint distributionD=P(x, y). One would be interested in the minimisation of therisk functional (or expected risk) [Vapnik, 1999, Sch¨olkopf and Smola, 2002]

R(f) =ED(`(y, f(x)) = Z

X ×Y

`(y, f(x))dP(x, y), (2.5)

which collects and weighs the loss of all possible data tuples. However, as the underly-ing probability distribution P is unknown, the expected risk can be estimated via the empirical risk

Remp(f) = ˆE(`(y, f(x)) = 1 n

n

X

i=1

`(yi, f(xi)) (2.6) and training examples (x1, y1), . . . ,(xn, yn) drawn from distributionD. In order to find an appropriate or optimal predictor function f amongst a multitude of functions one has to fix or restrict the set of potential candidates. We call the function space H the hypothesis space or candidate space. The aim is to find the best function of the hypothesis space with respect to the empirical risk, which leads us to the empirical risk minimisation inductive principle.

2.3 Learning Theory Definition 2.4 (ERM). [Sch¨olkopf and Smola, 2002] Let H be a space of functions mapping from X toY with normk · kH and ` be a loss function. The optimisation

minf∈H Rreg(f) = min

f∈H n

X

i=1

`(yi, f(xi)) (2.7)

is called empirical risk minimisation (ERM).

If the hypothesis space is rich and the data is noisy or does not carry sufficient in-formation, the learning task might still be intractable or not solvable in a satisfactory manner [Vapnik, 1999, Sch¨olkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004].

In this context, overfitting denotes the overly adaption of a predictor function to the observations, disregarding the true functional relationship between inputs and outputs.

These effects can for example be suppressed by a further limitation or restriction on the functions of the candidate set, known as regularisation. A prominent example of regu-larisation is the inclusion of the function norm into the considered objective to optimise.

The following definition is a specification of Definition 2.4.

Definition 2.5 (RRM). LetHbe a candidate space as introduced above. Let, further-more,g:R→Rbe a strictly monotonically increasing function. The functional

Rreg(f) =g(kfkH) +

n

X

i=1

`(yi, f(xi)), (2.8) is calledregularised empirical risk and its minimisation with respect to f is called regu-larised risk minimisation (RRM).

The regularised risk functionalRregis the starting point for a variety of machine learning algorithms, where loss function`and regularising term g(kfkH) vary from approach to approach. All machine learning algorithms considered in the main chapters below follow the (regularised) ERM principle or a related approach. The details can be found in Chapters 3, 4, and 5. In Definition 2.5 we omit the factor 1n from the empirical risk Rempbecause of the flexibility of the functiong, for example, ifg(·) =ν(·)2for a trade-off parameterν >0. We will use the term error synonymously for risk.

2.3.2 Rademacher Complexity

Although not known precisely, it is possible to control the empirical risk of a predictor function, i.e., its qualification to generalise to arbitrary data instances [Shawe-Taylor and Cristianini, 2004]. One would prefer a function class as candidate space such that for every function the difference between training and true error is small (known as uniform convergence).

Definition 2.6 (Empirical Rademacher complexity). [Bartlett and Mendelson, 2002]

Let x1, . . . , xn ∈ X be a random sample of instances drawn i.i.d. from distribution D andHbe a function class. Withσ= (σ1, . . . , σn) we denotenindependently identically distributed Rademacher random variables. Theempirical Rademacher complexity of H is defined as

n(H) =Eσ

"

sup

f∈H

2 n

n

X

i=1

σif(xi)

# .

The empirical Rademacher complexity is a measure of a function class to fit random data. This property is also known as capacity of a function class. Another measure of function class capacity (or complexity) that is independent of the data’s probability distribution is the so-calledVapnik-Chervonenkis dimension. A big capacity of a function class implies a big capacity to find patterns in random noise. For this reason, a small empirical Rademacher complexity is desirable for the function class H. In Definition 2.6, the randomness should be represented by the Rademacher random variablesσ. The following theorem supports the ERM principle, as the difference between expected risk and empirical risk can be controlled via the empirical Rademacher complexity.

Theorem 2.7. Let δ∈(0,1). AssumeHis a class of functions f :X → Y and `a loss function mapping into [0,1] without loss of generality. For every f ∈ H

ED(`(y, f(x))≤Eˆ(`(y, f(x)) + ˆRn(H) + 3

rln(2/δ) 2n holds true with probability greater or equal (1−δ).

The theorem was proven by Shawe-Taylor and Cristianini [2004]. Applying Theorem 2.7, a bound on the empirical Rademacher complexity finally gives us the opportunity to compare different approaches with respect to their theoretical generalisation perfor-mance. An example of a bound on the empirical Rademacher complexity depending on the algorithm’s parameters and the data sample can be found in Section 4.3.5.

2.3.3 Phases of Learning

As we have seen above, the expected risk can be bounded via Rademacher complexity ˆRn and empirical riskRemp. In the present section, we address the issue of how the learning process can be directed in practice such that the empirical risk of the learned model f becomes minimal given the algorithm, the candidate space, the data, and the limitations of optimisation. Indeed, the successful accomplishment of a machine learning task T requires the two phasestraining and testing. During training the available information or experience E is used to actually learn (or train) a model or predictor function, for example via the ERM principle. We will refer to the available data during the training phase as training data, regardless whether the data is labelled or unlabelled, or of a completely other kind (such as similarity values between protein targets like in Chapter 5). An optimisation procedure of the algorithm’s parameters is also included in the learning phase. It is necessary to pick the best assignment of parameter values for the task at hand. In the testing phase the learned model has to be evaluated with respect to its prediction performance using test data. In this connection, the performance measure P is not always equal to the applied loss function. Via the loss function the predictor function can be equipped with desirable properties such as the sparsity in the case of ε-insensitive loss. However, with respect to the performanceP one is interested in the actual discrepancy between true and prediction value (least squares loss in the case of regression). Usually, the entire available data is divided into training data and test data.

In the case of supervised or semi-supervised learning, the known labels are compared with the predictions in order to calculate a performance measure.

In order to choose the best parameter assignment and to assess the quality of the learned model as good as possible, i.e., to give a good estimate of its expected risk, we need to

2.4 Optimisation Theory