Intelligent Systems

(1)

Intelligent Systems

Discriminative Models

1. Neurons and Neural Networks 2. Support Vector Machines

(2)

Neuron

Hunan vs. Computer

(two nice pictures from Wikipedia)

(3)

Neuron (McCulloch and Pitt, 1943)

Input:

Weights:

Activation:

Output:

Step-function If

otherwise

(4)

Geometric interpretation

Let be normalized, i.e.

the length of the projection of onto .

Separation plane:

Neuron implements a linear classifier

(5)

Learning

Given: training data

Find: so that for all For a step-neuron: system of linear inequalities

ifif

(6)

“Preparation 1” *

Eliminate the bias:

The trick − modify the training data

0 0

1

Example in 1D

(7)

“Preparation 2” *

Remove the sign:

The trick − the same

for all with for all with

All in all:

if

(8)

Perceptron Algorithm (Rosenblatt, 1958)

Solution of a system of linear inequalities:

1. Search for an equation that is not satisfied, i.e.

2. If not found − Stop else update

go to 1.

• The algorithm terminates in a finite number of steps if a solution exists (the training data are separable)

• The solution is a convex combination of the data points

(9)

An example problem

Consider another decision rule for a real valued feature :

It is not a linear classifier anymore but a polynomial one.

The task is again to learn the unknown coefficients given the training data

Is it also possible to do that in a “Perceptron-like” fashion ?

(10)

An example problem

The idea: reduce the given problem to the Perceptron-task.

Observation: although the decision rule is not linear with respect to , it is still linear with respect to the unknown coefficients

The same trick again − modify the data:

In general, it is very often possible to learn non-linear decision rules by the Perceptron algorithm using an appropriate transformation of the input space (more examples at seminars).

(11)

Many classes

Before: two classes − a mapping Now: many classes − a mapping How to generalize ? How to learn ?

Two simple (straightforward) approaches:

The first one: “one vs. all” − there is one binary classifier per class, that separates this class from all others.

The classification is ambiguous in some areas.

(12)

Many classes

Another one:

“pairwise classifiers” − there is a classifier for each class pair

The goal:

• no ambiguities,

•

Less ambiguous, better separable.

However:

binary classifiers

instead of in the previous case.

⇒ Fisher Classifier

(13)

Feed-Forward Neural Networks

Output level i-th level First level Input level

(14)

Learning – Error Back-Propagation

Learning task:

Given: training data

Find: all weights and biases of the net.

Error Back-Propagation is a gradient descent method for Feed- Forward-Networks with Sigmoid-neurons

First, we need an objective (error to be minimized)

Now: derive, build the gradient and go.

(15)

Error Back-Propagation *

We start from a single neuron and just one example . Remember:

Derivation according to the chain-rule:

(16)

Error Back-Propagation *

The “problem”: for intermediate neurons the errors are not known ! Now a bit more complex:

with:

(17)

Error Back-Propagation

In general: compute “errors” at the i-th level from all -s at the i+1-th level – propagate the error.

The Algorithm (for just one example ):

1. Forward: compute all and (apply the network), compute the output error ;

2. Backward: compute errors in the intermediate levels:

3. Compute the gradient and go.

(18)

A special type (for CV  ) – Convolutional Networks

Local features – convolutions with a set of predefined masks (see lectures “Computer Vision”).

(19)

Neural Networks – summary

1. Single neurons are linear classifiers.

2. Learning by the Perceptron-algorithm.

3. Many non-linear decision rules can be transformed to linear ones by the corresponding transformation of the input space.

4. Classification into more than two classes is possible either by naïve approaches (e.g. by a set of “one-vs-all” simple classifiers) or more principled by Fisher-Classifier.

5. Feed-Forward Neural Networks implement (arbitrary complex) decision strategies.

6. Error back-propagation for learning (gradient descent).

7. Some special cases are useful in Computer Vision.

(20)

Two learning tasks

Let a training dataset be given with (i) data and (ii) classes

The goal is to find a hyper plane that separates the data (correctly)

________________________________________________________

Now: The goal is to find a “corridor”

(stripe) of the maximal width that separates the data (correctly).

(21)

Linear Support Vector Machine

Remember that the solution is defined only up to a common scale

→ Use canonical (with respect to the learning data) form in order to avoid ambiguity:

The margin:

The optimization problem:

(22)

Linear SVM *

The Lagrangian of the problem:

The meaning of the dual variables v :

a) (a constraint is broken) → maximization

wrt. gives: (surely not a minimum) b) → maximization wrt. gives →

no influence on the Lagrangian

c) → does not mater, the vector is

located “on the wall of the corridor” – Support Vector

(23)

Linear SVM *

Lagrangian:

Derivatives:

(24)

Linear SVM *

Substitute into the decision rule and obtain

→ the vector is not needed explicitly !!!

The decision rule can be expressed as a linear combination of scalar products with support vectors.

Only strictly positive (i.e. those corresponding to the support vectors) are necessary for that.

(25)

Linear SVM *

Substitute

into the Lagrangian

and obtain the dual task

(26)

Feature spaces

1. The input space is mapped onto a feature space by a non- linear transformation

2. The data are separated (classified) by a linear decision rule in the feature space

Example: quadratic classifier

The transformation is

(27)

Feature spaces

The images are not explicitly necessary in order to find the separating plane in the feature space, but their scalar products

For the example above:

→ the scalar product can be computed in the input space, it is not necessary to map the data points onto the feature space explicitly.

(28)

Kernels

Kernel is a function that computes scalar product in a feature space

Neither the corresponding space nor the mapping need to be specified thereby explicitly → “Black Box”.

Alternative definition: if a function is a kernel, then there exists such a mapping , that … The corresponding feature space is called the Hilbert space induced by the kernel . Let a function be given. Is it a kernel?

→ Mercer’s theorem.

(29)

Kernels

Let and be two kernels.

Than are kernels as well

(there are also other possibilities to build kernels from kernels).

Popular Kernels:

• Polynomial:

• Sigmoid:

• Gaussian: (interesting : )

(30)

An example

The decision rule with a Gaussian kernel

(31)

SVM – Summary

• SVM is a representative of discriminative learning – i.e. with all corresponding advantages (power) and drawbacks (overfitting) – remember e.g. the Gaussian kernel with

• The building block – linear classifiers. All formalisms can be

expressed in terms of scalar products – the data are not needed explicitly.

• Feature spaces – make non-linear decision rules in the input spaces possible.

• Kernels – scalar product in feature spaces, the latter need not be necessarily defined explicitly.

Literature (names):

•