• Keine Ergebnisse gefunden

Machine Learning

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning"

Copied!
20
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Neuron, linear classifiers

17/11/2014

(2)

Neuron

Hunan vs. Computer

(two nice pictures from Wikipedia)

(3)

Neuron (McCulloch and Pitt, 1943)

Input:

Weights:

Activation:

Output:

Step-function

Sigmoid-function (differentiable!!!)

If

otherwise

(4)

Geometric interpretation

Let be normalized, i.e.

the length of the projection of onto .

Separation plane:

Neuron implements a linear classifier

(5)

Special case − Boolean functions

Input:

Output:

Find and so, that

Disjunction, other Boolean functions, but XOR

(6)

Learning

Given: training data

Find: so that for all For a step-neuron: system of linear inequalities

Solution is not unique in general !

ifif

(7)

“Preparation 1”

Eliminate the bias:

The trick − modify the training data

0 0

1

Example in 1D

non-separable without the bias separable without the bias

(8)

“Preparation 2”

Remove the sign:

The trick − the same

for all with for all with

All in all:

if if

(9)

Perceptron Algorithm (Rosenblatt, 1958)

Solution of a system of linear inequalities:

1. Search for an equation that is not satisfied, i.e.

2. If not found − Stop else update

go to 1.

• The algorithm terminates if a solution exists (the training data are separable)

• The solution is a convex combination of the data points

(10)

Proof of convergence

The idea: look for quantities that a) grow/decrease quite fast, b) are bounded.

Consider the length of at n-th iteration:

with <0, because added by the algorithm

(11)

Proof of convergence

Another quantity − the projection of onto the solution .

With − the Margin

>0, because of the solution

(12)

Proof of convergence

All together:

But (Cauchy-Schwarz inequality) So and finally

If the solution exists,

the algorithm converges after steps at most.

and

(13)

An example problem

Consider another decision rule for a real valued feature :

It is not a linear classifier anymore but a polynomial one.

The task is again to learn the unknown coefficients given the training data

Is it also possible to do that in a “Perceptron-like” fashion ?

(14)

An example problem

The idea: reduce the given problem to the Perceptron-task.

Observation: although the decision rule is not linear with respect to , it is still linear with respect to the unknown coefficients

The same trick again − modify the data:

In general, it is very often possible to learn non-linear decision rules by the Perceptron algorithm using an appropriate transformation of the input space (further extension − SVM).

(15)

Many classes

Before: two classes − a mapping Now: many classes − a mapping How to generalize ? How to learn ?

Two simple (straightforward) approaches:

The first one: “one vs. all” − there is one binary classifier per class, that separates this class from all others.

The classification is ambiguous in some areas.

(16)

Many classes

Another one:

“pairwise classifiers” − there is a classifier for each class pair

The goal:

• no ambiguities,

• parameter vectors

Less ambiguous, better separable.

However:

binary classifiers

instead of in the previous case.

(17)

Fisher Classifier

Idea: in the binary case the output is the more likely to be “1”

the greater is the scalar product → generalization:

The input space is partitioned into the set of convex cones.

Geometric interpretation (let be normalized)

Consider projections of an input vector onto vectors

(18)

Fisher Classifier

Given: training set

To be learned: weighting vectors The task is to choose so that

It can be equivalently written as

− a system of linear inequalities, but a “heterogenic” one.

The trick − transformation of the input/parameter space.

(19)

Fisher Classifier

Example for three classes: Consider e.g. a training example , it leads to the following inequalities:

Let us define the new parameter vector as

i.e. we “concatenate” all to a single vector.

For each inequality (see example above) we introduce a “data point”:

→ all inequalities are written in form of a scalar product Solution by the Perceptron Algorithm.

(20)

Conclusion

Remember the „hiearachy of abstraction“:

• Statistical generative models – Maximum Likelihood

• Statistical discriminative models – Conditional Likelihood

• Discriminant function – empirical risk minimization Today: discriminative learning

• Neuron – linear classifier

• Perceptron Algorithm – simple update rule, convergence

• Fisher classifier – „Multiclass Perceptron“

Next Lecture: Exponential family – a model class to illustrate all previous approaches in a unified manner.

Referenzen

ÄHNLICHE DOKUMENTE

Article 1(2) of the Convention defined terrorism as “any act of violence or threat thereof notwithstanding its motives or intentions perpetrated to carry out

For training set size of ∼ 118k (90% of data set) we have found the additional out-of-sample error added by ML to be lower or as good as DFT errors at B3LYP level of theory relative

In this paper, we have proposed the use of Markov chains and transition matrices to model transitions between databases, and used them to define a probabilistic metric space for

Many non-linear decision rules can be transformed to linear ones by the corresponding transformation of the input space. Classification into more than two classes is possible either

The approach – minimize the average (expected) loss – Simple δ-loss – Maximum A-posteriori Decision – Rejection – an example of &#34;extended&#34; decision sets

The decision strategy e : X → K partitions the input space into two re- gions: the one corresponding to the first and the one corresponding to the second class.. How does this

The performance of landcover classification models based on three widely used machine learning algorithms (MLA) such as Random Forest (RF), Support Vector Machine (SVM) and

In this section we show (following the proof of [5, Theorem 28.1]) that the proximal point type algorithm remains convergent when the function to be minimized is proper, lower