• Keine Ergebnisse gefunden

Machine Learning

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning"

Copied!
63
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Volker Roth

Department of Mathematics & Computer Science University of Basel

(2)

Section 7

Support Vector Machines and Kernels

(3)

Structure on canonical hyperplanes

Theorem (Vapnik, 1982)

Let R be the radius of the smallest ball containing the pointsx1, . . . ,xn: BR(a) ={x ∈Rd :kx−ak<R,a∈Rd}. The set of canonical

hyperplane decision functions f(w,w0) =sign{wtx+w0} satisfying kwk ≤A has VC dimension h bounded by

hR2A2+ 1.

Intuitive interpretation: margin = 1/kwk

minimizing capacity(H) corresponds to maximizing the margin.

R[fn]≤Remp[fn] + sa

n

capacity(H) + lnb δ

Large margin classifiers.

(4)

SVMs

When the training examples are linearly separable we can maximize the margin by minimizing the regularization term

kwk2/2 =

d

X

i=1

wi2/2

subject to the classification constraints yi[xtiw]−1≥0, i = 1, . . . ,n.

x x

x

x

x x

x

o

o o o

o o o x

W

The solution is defined only on the basis of a subset of examples or support vectors.

(5)

SVMs: nonseparable case

Modify optimization problem slightly by adding a penalty for violating the classification constraints:

minimize kwk2/2 +C

n

X

i=1

ξi

subject to relaxed constraints

yi[xtiw]−1 +ξi ≥0, i = 1, . . . ,n.

x x

x

x

x x

x

o

o o o

o o o x

W x

x

o

Theξi ≥0 are called slack variables.

(6)

SVMs: nonseparable case

We can also write the SVM optimization problem more compactly as C

n

X

i=1

ξi

z }| {

(1−yi[xtiw])++kwk2/2,

where (z)+=z ifz ≥0 and zero otherwise.

This is equivalent to regularized empirical loss minimization 1

n

n

X

i=1

(1−yi[xtiw])+

| {z }

Remp

+λkwk2,

whereλ= 1/(2nC) is the regularization parameter.

(7)

SVMs and LOGREG

When viewed from the point of view of regularized empirical loss minimization, SVM and logistic regression appear quite similar:

SVM: 1 n

n

X

i=1

(1−yi[xtiw])++λkwk2

LOGREG: 1 n

n

X

i=1

−log

P(yi|xi,w)

z }| {

σ(yi[xtiw]) +λkwk2, whereσ(z) = (1 +e−z)−1 is the logistic function.

(8)

SVMs and LOGREG

The difference comes from how we penalize errors:

Both: 1 n

n

X

i=1

Loss(

z

z }| {

yi[xtiw]) +λkwk2,

SVM: Loss(z) = (1−z)+ LOGREG:

Loss(z) = log(1 + exp(−z))

−3 −2 −1 0 1 2 3

0.00.51.01.52.02.53.0

(9)

SVMs: solution, Lagrange multipliers

Back to the separable case: how do we solve

minimizew kwk2/2 s.t. yi[xtiw]−1≥0, i = 1, . . . ,n.

Represent the constraints as individual loss terms:

sup

αi≥0

αi(1−yi[xtiw]) =

(0, if yi[xtiw]−1≥0,

∞, otherwise.

Rewrite the minimization problem:

minimizew kwk2/2 +

n

X

i=1

sup

αi≥0

αi(1−yi[xtiw])

= minimizew sup

αi≥0

kwk2/2 +

n

X

i=1

αi(1−yi[xtiw])

!

(10)

SVMs: solution, Lagrange multipliers

Swap maximization and minimization (technically this requires that the problem is convex and feasible Slater’s condition):

minimizew

"

sup

αi≥0

kwk2/2 +

n

X

i=1

αi(1−yi[xtiw])

!#

= maximizeαi≥0

minw

kwk2/2 +

n

X

i=1

αi(1−yi[xtiw])

| {z }

J(w;α)

We have to minimize J(w;α) over parametersw for fixed Lagrange multipliersαi ≥0.

Simple, because J(w) is convex set derivative to zero only one stationary point global minimum.

(11)

SVMs: solution, Lagrange multipliers

Find optimalw by setting the derivatives to zero:

∂wJ(w;α) =wX

i

αiyixi = 0 ⇒ wˆ =X

i

αiyixi.

Substitute the solution back into the objective and get (after some re-arrangements of terms):

maxαi≥0min

w

kwk2/2 +

n

X

i=1

αi(1−yi[xtiw])

=maxαi≥0

kwˆk2/2 +

n

X

i=1

αi(1−yi[xtiwˆ])

=maxαi≥0

n X

i=1

αi −1 2

n

X

i,j=1

yiyjαiαjxtixj

(12)

SVMs: summary

Find optimal Lagrange multipliers ˆαi by maximizing

n

X

i=1

αi −1 2

n

X

i,j=1

yiyjαiαjxtixj subject toαi ≥0.

Only ˆαi’s corresponding to support vectors will be non-zero.

Make predictions on any new example x according to:

sign(xtwˆ) = sign(xt

n

X

i=1

ˆ

αiyixi) = sign(X

i∈SV

ˆ

αiyixtxi).

Observation: dependency on input vectors only via dot products.

Later we will introduce the kernel trick for efficiently computing these dot products in implicitly defined feature spaces.

(13)

SVMs: formal derivation

Convex optimization problem: an optimization problem

minimize f(x) (1)

subject to gi(x)≤0, i = 1, . . . ,m (2) is convex if the functionsf,g1. . .gm :Rn→R are convex.

The Lagrangian function for the problem is

L(x, λ0, ..., λm) =λ0f(x) +λ1g1(x) +...+λmgm(x).

Karush-Kuhn-Tucker (KKT) conditions: For each point ˆx that minimizesf, there exist real numbers λ0, . . . , λm,

calledLagrange multipliers, that simultaneously satisfy:

1 ˆx minimizesL(x, λ0, λ1, . . . , λm),

2 λ00, λ10, . . . , λm0, with at least oneλk >0,

3 Complementary slackness: gix)<0λi= 0,1im.

(14)

SVMs: formal derivation

Slater’s condition: If there exists a strictly feasible pointz satisfying g1(z)<0, . . . ,gm(z)<0, then one can setλ0= 1.

Assume that Slater’s condition holds. Minimizing the supremum L(x) = supλ≥0L(x,λ), is the primal problem P:

ˆ

x = argmin

x L(x).

Note that L(x) = sup

λ≥0

f(x) +

m

X

i=1

λigi(x)

!

=

(f(x) , if gi(x)≤0∀i

∞ , else.

Minimizing L(x) is equivalent to minimizing f(x).

The maximizer of the dual problem D is λˆ = argmax

λ L(λ), where L(λ) = inf

x L(x,λ).

(15)

SVMs: formal derivation

The non-negative number min(P) – max(D) is the duality gap.

Convexity and Slater’s condition imply strong duality:

1 The optimal solution (ˆx,λ) is a saddle point ofˆ L(x,λ)

2 Theduality gap is zero.

Discussion: For any real function f(a,b) mina[maxbf(a,b)]≥maxb[minaf(a,b)] . Equality saddle value exists.

By Nicoguaro - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=20570051

(16)

Kernel functions

A kernel function is a real-valued function of two arguments, k(x,x0)∈R, for x,x0 ∈ X.

Typically the function is symmetric, and sometimes non-negative.

In the latter case, it might be interpreted as a measure of similarity.

Example: isotropic Gaussian kernel:

k(x,x0) = exp −kx−x0k22

!

Here, σ2 is the bandwidth. This is an example of a

radial basis function (RBF) kernel (only a function ofkx−x0k2).

(17)

Mercer kernels

A symmetric kernel is a Mercer kernel, iff the Gram matrix K =

k(x1,x1) . . . k(x1,xn) ...

k(xn,x1) . . . k(xn,xn)

is positive semidefinite for any set of inputs {xi, . . . ,xn}.

Mercer’s theorem: Eigenvector decomposition K =VΛVt = (VΛ1/2)(VΛ1/2)t =: ΦΦt.

Eigenvectors: columns of V. Eigenvalues: entries of diagonal matrix Λ = diag(λ1, . . . , λn). Note thatλi ∈R andλi ≥0.

Defineφ(xi)t =i-th row of Φ =V[i•]Λ1/2 k(xi,xi0) =φ(xi)tφ(xi0).

Entries ofK: inner product of some feature vectors, implicitly defined by eigenvectors V.

(18)

Mercer kernels

If the kernel is Mercer, then there exists φ:x →Rd such that k(x,x0) =φ(x)tφ(x0),

whereφdepends on the eigenfunctions of k (d might be infinite).

Example: Polynomial kernel k(x,x0) = (1 +xtx0)m.

Corresponding feature vector contains terms up to degreem.

Example: m= 2, x ∈R2:

(1 +xtx0)2 = 1 + 2x1x10 + 2x2x20 + (x1x10)2+ (x2x20)2+ 2x1x10x2x20. Thus,

φ(x) = [1,√ 2x1,

2x2,x12,x22,

2x1x2]t. Equivalent to working in a 6-dim feature space.

Gaussian kernel: feature map lives in an infinite dimensional space.

(19)

Kernels for documents

In document classification or retrieval, we want to compare two documents, xi andxi0.

Bag of words representation:

xij is the number of times wordj occurs in documenti. One possible choice: Cosine similarity:

k(xi,xi0) = xtixi0

kxikkxi0k =:φ(xi)tφ(xi0).

Problems:

I Popular words (like “the” or “and”) are not discriminative remove thesestop words.

I Bias: once a word is used in a document, it is very likely to beused again.

Solution: Replace word counts with “normalized” representation.

(20)

Kernels for documents

TF-IDF “term frequency inverse document frequency”:

Term frequency is log-transform of the count:

tf(xij) = log(1 +xij) Inverse document frequency:

idf(j) = log #(documents)

#(documents containing term j) = log 1 ˆ pj. Shannon information content:

idf is a measure of how much information a word provides Combine with tf counts weighted by information content:

tf-idf(xi) = [tf(xij)·idf(j)]Vj=1, where V = size of vocabulary.

We then use this inside the cosine similarity measure.

Withφ(x) = tf-idf(x):

k(xi,xi0) = φ(xi)tφ(xi0) kφ(xi)kkφ(xi0)k.

(21)

String kernels

Real power of kernels arises for structured input objects.

Consider two strings x, andx0 of lengths d,d0, over alphabetA.

Idea: define similarity as the number of common substrings.

Ifs is a substring of x φs(x) = number of times s appears inx.

String kernel k(x,x0) = X

s∈A

wsφs(x)φs(x0),

wherews ≥0 and A= set of all strings (any length) from A.

One can show: Mercer kernel, can be computed in O(|x|+|x0|) time using suffix trees (Shawe-Taylor and Cristianini, 2004).

Special case: ws = 0 for|s|>1: bag-of-characters kernel:

φ(x) is the number of times each character in Aoccurs in x.

(22)

The kernel trick

Idea: modify algorithm so that it replaces all inner products xtx0 with a call to the kernel function k(x,x0).

Kernelized ridge regression: wˆ = (XtX+λI)−1Xty. Matrix inversion lemma:

(I+UV)−1U =U(I+VU)−1 Define new variablesαi:

ˆ

w = (XtX+λI)−1Xty

=Xt(XXt+λI)−1y

| {z }

ˆ α

=

n

X

i=1

ˆ αixi.

solution is linear sum of the n training vectors.

(23)

The kernel trick

Use this and the kernel trick to make predictions for x:

ˆf(x) = ˆwtx =

n

X

i=1

ˆ αixtix =

n

X

i=1

ˆ

αik(xi,x).

Same for SVMs:

ˆ

wtx = X

i∈SV

ˆ

αiyixtix = X

i∈SV

ˆ

α0ik(xi,x) ...and for most other classical algorithms in ML!

(24)

Some applications in bioinformatics

Bioinformatics: oftennon-vectorial data-types:

I interaction graphs

I phylogenetic trees

I strings GSAQVKGHGKKVADALTNAVAHV

Data fusion: convert data of each type into kernel matrix

fuse kernel matrices

“common language” for heterogeneous data.

(25)

RBF kernels from expression data

Measurements(for each gene): vector of expression values under different experimental conditions

“classical” RBF kernelk(x1,x2) = exp(−σkx1x2k2)

(26)

Diffusion kernels from interaction-graphs

A: Adjacency matrix,D: node degrees,L=DA.

K := Z(β)1 exp(−βL) with transition probabilities β.

Physical interpretation (random walk):

randomly choose next node among neighbors.

Self-transition occurs with prob. 1−diβ

Kij: prob. for walk fromi to j.

(Kondor and Lafferty, 2002)

(27)

Alignment kernels from sequences

Alignment with Pair HMMs Mercer kernel (Watkins, 2000).

Image source: Durbin, Eddy, Krogh, Mitchison. Biological Se- quence Alignment. Cambridge.

(28)

Combination of heterogeneous data

Adding kernelsnew kernel:

k1(x,y) =φ1(x)·φ1(y),

k2(x,y) =φ2(x)·φ2(y) ⇒ k0=k1+k2= φφ1(x)

2(x)

· φφ1(y)

2(y)

Fusion & relevance determination: kernel-combinations

= 1 + 2 + 3 + 4

K c K1 c K2 c K3 c K4

(29)

Section 8

Gaussian Processes: probabilistic kernel models

(30)

Overview

The use of the Gaussian distribution in ML

I Properties of themultivariate Gaussian distribution

I Random variablesrandom vectorsstochastic processes

I Gaussian processes for regression

I Model Selection

I Gaussian processes for classification Relation tokernel models(e.g. SVMs) Relation toneural networks.

(31)

Kernel Ridge Regression

Kernelized ridge regression: wˆ = (XtX+λI)−1Xty. Matrix inversion lemma: (I+UV)−1U =U(I+VU)−1 Define new variablesαi:

ˆ

w = (XtX+λI)−1Xty

=Xt(XXt+λI)−1y

| {z }

ˆ α

=

n

X

i=1

ˆ αixi.

Predictions for new x: ˆf(x) = ˆwtx=

n

X

i=1

ˆ

αixtix=

n

X

i=1

ˆ

αik(xi,x).

(32)

Kernel Ridge Regression

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

Kernel function: k(xi,xj) = exp(−2l12kxixjk2)

(33)

How can we make use of the Gaussian distribution?

y_1

y_2

−2 −1 0 1 2

−2−1012

X

−2

−1 0

1 2

Y

−2

−1 0 1 2

Z 0.0 0.1 0.2 0.3

Is it possible to fit anonlinear regression line with the “boring”

Gaussian distribution?

Yes, but we need to introduce the concept of Gaussian Processes!

(34)

The 2D Gaussian distribution

2D Gaussian: P(y;µ=0,Σ =K) = √ 1

2π|K|exp(−12ytK−1y)

Covariance

(also written “co-variance”) is a measure of how muchtwo random variables vary to- gether:

+1: perfect linear coherence,

-1: perfect negative linear coherence, 0: no linear coherence.

y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1 0 0 1





y_1

y_2

−2 −1 0 1 2

−2−1012

K==

 1.0 0.5 0.5 1.0





y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1.00 0.95 0.95 1.00





y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1.00 −0.8

−0.8 1.00





(35)

Properties of the Multivariate Gaussian distribution

y ∼ N(µ,K). Let y = y1 y2

!

and K = K11 K12 K21 K22

! . Then y1∼ N(µ1,K11) andy2 ∼ N(µ2,K22).

−2 −1 0 1 2

−2−1012

K=



0.75 −0.2

−0.2 0.25



y_1

y_2

Marginals of Gaussians are again Gaussian!

(36)

Properties of the Multivariate Gaussian distribution (2)

y ∼ N(µ,K). Let y = y1 y2

!

and K = K11 K12

K21 K22

! . Then y2|y1 ∼ N(µ2+K21K11−1(y1µ1),K22K21K11−1K12).

X

−2

−1 0

1

2

Y

−2

−1 0

1 2

Z

0.00 0.05 0.10 0.15

Conditionals of Gaussians are again Gaussian!

(37)

2D Gaussians: a new visualization

top left: meanand

±std.dev.ofp(y2|y1= 1).

bottom left: p(y2|y1= 1) and samples drawn from it.

top right: x-axis: indices (1,2) of dimensions, y-axis: density in each component. Shown are y1= 1 and the conditional mean ¯p(y2|y1= 1) and std.dev.

bottom right: samples drawn from above model.

y_1

y_2

0.1

0.2

0.3

−2 −1 0 1 2

−2−1012

−2 −1 0 1 2

0.00.20.40.60.8

y_2

p(y_2|y_1)

−2−1012

1 2

−2−1012

1 2

(38)

Visualizing high-dimensional Gaussians

top left: 6 samples drawn from 5-dimensional Gaussian with zero mean (indicated byblueline).

σ= 1 (magenta line).

bottom left: Conditional meanandstd.devof

p(y4,y5|y1=−1,y2= 0,y3= 0.5).

top right: contour lines of

p(y4,y5|y1=−1,y2= 0,y3= 0.5).

bottom right: samples drawn from above model.

1 2 3 4 5

−2−1012

1 2 3 4 5

−2−1012

y_4

y_5

0.5 1 1.5

2

2.5

−2 −1 0 1 2

−2−1012

1 2 3 4 5

−2−1012

(39)

From covariance matrices to Gaussian processes

top left: 8 samples, 6 dim.

x-axis: dimension-indices.

bottom left: 8 samples, viewed as valuesy =f(x).

Construction: choose 6 input pointsxi at random

build covariance matrixK withcovariance function k(x,x0) = exp(−2l12kx−x0k2)

drawf ∼ N(0,K) plot as function of inputs.

top right: same for 12 inputs bottom right: 100 inputs

0 1 2 3 4 5 6 7

−2−101

0 1 2 3 4 5 6 7

−1012

0 1 2 3 4 5 6 7

−2−1012

0 1 2 3 4 5 6 7

−3−2−101

(40)

This looks similar to Kernel Regression...

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

(41)

Gaussian Processes

Gaussian Random Variable(RV):f ∼ N(µ, σ2).

Gaussian Random Vector: Collection of n RVs, characterized by mean vector and covariance matrix: f ∼ N(µ,Σ)

Gaussian Process: infinite Gaussian random vector, every finite subset of which is jointly Gaussian distributed

Continuous index, e.g. timet function f(t).

Fully specified by mean function m(t) =E[f(t)]

andcovariance functionk(t,t0) =E[(f(t)−m(t))(f(t0)−m(t0))].

In ML, we will focus on more general index setsx ∈Rd with mean functionm(x) and covariance functionk(x,x0):

f(x)∼ GP(m(x),k(x,x0)).

(42)

Visualizing Gaussian Processes: Sampling

Problem: working with infinite vectors and covariance matrices is not very intuitive...

Solution: evaluate the GP at set ofn discrete times (or input vectors x ∈Rd):

I Chooseninput pointsxi at random matrixX

I buildcovariance matrixK(X,X) withcovariance functionk(xi,xj)

I samplerealizations of the Gaussian random vector f ∼ N(0,K(X,X))

I plotf asfunction of inputs.

(43)

This is exactly what we have done here...

0 1 2 3 4 5 6 7

−2−101

0 1 2 3 4 5 6 7

−1012

0 1 2 3 4 5 6 7

−2−1012

0 1 2 3 4 5 6 7

−3−2−101

(44)

From the Prior to the Posterior

GP defines distribution over functions f evaluated at training pointsX and f evaluated at test pointsX are jointly Gaussian:

"

f f

#

∼ N 0,

"

K(X,X) K(X,X) K(X,X) K(X,X)

#!

Posterior p(f|X,X,f(X)): conditional of a Gaussian distribution.

Let x ∼ N(µ,K). Letx = x1 x2

!

andK = K11 K12 K21 K22

! . Then x2|x1∼ N(µ2+K21K11−1(f1µ1),K22K21K11−1K12).

f|X,X,f ∼ N( K(X,X)(K(X,X))−1f,

K(X,X)−K(X,X)(K(X,X))−1K(X,X)) For only one test case:

f|x,X,f ∼ N(ktK−1f,k∗∗ktK−1k)

(45)

A simple extension: noisy observations

Assume we have access only to noisy versions of function values:

y =f(x) +η, η∼ N(0, σ2) (cf. initial example of ridge regression).

Noise η does not depend on data!

Covariance of noisy observationsy is sum of covariance off and variance of noise: cov(y) =K(X,X) +σ2I.

"

y f

#

∼ N 0,

"

K(X,X) +σ2I K(X,X) K(X,X) K(X,X)

#!

f|X,X,y ∼ N( K(X,X)(K(X,X) +σ2I)−1y,

K(X,X) −K(X,X)(K(X,X) +σ2I)−1K(X,X)) f|x,X,f ∼ N(kt(K +σ2I)−1y,k∗∗kt(K +σ2I)−1k)

⇒ Posterior mean is solution of kernel ridge regression!

(46)

Noisy observations: examples

0.0 0.2 0.4 0.6 0.8

0.00.20.40.6

x

y

σ

~N(0, )2

f(x)

f(x)=0.5 x

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

Noisy observations: y =f(x) +η, η∼ N(0, σ2) Mean predictions: fˆ =K(K +σ2I)−1y.

(47)

Gaussian processes for regression

−10 −5 0 5 10

−1.0−0.50.00.51.01.5

now with some noise...

●●

●●

●●

−10 −5 0 5 10

−1.0−0.50.00.51.01.5

Posterior sample

●●

0 1 2 3 4 5 6 7

−2−1012

Prior samples

Left: 11 training points generated asy= sin(x)/x+ν,ν∼ N(0,0.01) Covariancek(xp,xq) = exp(−2l12kxpxqk2) +σ2δp,q.

100 test points uniformly chosen from [−10,10] matrixX. Mean predictionE[f|X,X,y]and±std.dev.

Middle: samples drawn from posteriorf|X,X,y. Right: samples drawn from priorf ∼ N(0,K(X,X)).

(48)

Covariance Functions

A GP specifies a distribution over functionsf(x), characterized by mean functionm(x) and covariance function k(xi,xj).

Finite subset evaluated at n inputs Gaussian distribution:

f(X) = (f(x1), . . . ,f(xn))t ∼ N(µ,K),

whereK is the covariance matrix with entriesKij =k(xi,xj).

Covariance matrices are symmetric positive semi-definite:

Kij =Kji and xtKx ≥0, ∀x.

We already know that Mercer kernels have this property all Mercer kernels define proper covariance functions in GPs.

Kernels frequently have additional parameters.

Thenoise variance in the observation model y =f(x) +η, η∼ N(0, σ2) is another parameter.

How should we choose these parameters? model selection.

(49)

Model Selection

top left: sample function from priorf ∼ N(0,K(X,X)) with covariance function

k(x,x0) = exp(−2l12kxx0k2).

Length scalel= 10−0.5small highly varying function.

bottom left: same forl= 100 smoother function

top right: same for l= 100.5 even smoother...

bottom right: almost linear function forl= 101.

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−2

−1 0 1 2

length scale: 10^−0.5, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1 0 1 2

length scale: 10^0, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1.0

−0.5 0.0 0.5 1.0

length scale: 10^0.5, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1.5

−1.0

−0.5 0.0 0.5

length scale: 10^1, sample no. 1

(50)

Model Selection (2)

How to select the parameters?

One possibility: maximize marginal likelihood:

p(y|X) = Z

p(y|f,X)p(f|X)df. We do not need to integrate: we know that

f|X ∼ N(0,K) and y =f +η, η∼ N(0, σ2).

Since η does not depend onX, the variances simply add:

y|X ∼ N(0,K +σ2I).

Possible strategy:

Select parameters on a grid and choose maximum.

Or: Compute derivatives of marginal likelihood and use gradient descent.

Referenzen

ÄHNLICHE DOKUMENTE

(2) Following (a) the exercise of the Option Rights in accordance with Section 7 (2) and (3) of the Issue Specific Conditions or (b) the occurrence of a Knock-Out

Issue Specific Conditions) reaches or falls below (Turbo Long) or reaches or exceeds (Turbo Short) the Current Knock-Out Barrier on an Observation Date

In this case, the row player can do no better than to play using a minmax mixed strategy which can be computed using linear programming, provided that the entire matrix is known

By the first algorithm (Method 1), we learn at first the negative rules from the negative seed and then integrate the negative rules in each positive rule learning iteration, see

Although the DARE rule representation is very expressive and can ideally cover all linguistic constructions that can be utilized as pattern rules, the

Für die SVM wird dann ein Tree Kernel definiert, welcher solche Baumstrukturen in der Form des Dependency Tree verwendet, um die Ähnlichkeit zwischen Relation zu berechnen. In

To foster the role of the self in learning seems to be the «new» way of learning, the morally «better» way of learning and the more effective way of learning.. Often, this kind

Issue Specific Conditions) reaches or falls below (Turbo Long) or reaches or exceeds (Turbo Short) the Current Knock-Out Barrier on an Observation Date