Machine Learning

(1)

Machine Learning

Neural Networks

(2)

Outline

Before (discriminative learning):

1. Linear classifiers

2. SVM-‐s: linear classifiers in feature spaces 3. AdaBoost: combination of linear classifiers Today:

1. Feed-‐Forward neural networks: further classifier combination — a “hierarchic” one

2. Hopfield networks — structured output 3. Stochastic extensions

(3)

Feed-‐Forward Networks

Output level i-‐th level First level Input level

Special case: , Step-‐neurons – a mapping

Note: the “combined classifier” from the previous lecture is a Feed-‐

Forward network with only one hidden layer.

(4)

Error Back-‐Propagation

Learning task:

Given: training data

Find: all weights and biases of the net.

Error Back-‐Propagation is a gradient descent method for Feed-‐

Forward networks with Sigmoid-‐neurons

First, we need an objective (error to be minimized)

Now: derive, build the gradient and go.

(5)

Error Back-‐Propagation

We start from a single neuron and just one example .

Note: the transfer function is a “sigmoid” one (not the “step”)

Derivation according to the chain-‐rule:

(6)

Error Back-‐Propagation

The “problem”: for intermediate neurons the errors are not known ! Now a bit more complex:

with:

(7)

Error Back-‐Propagation

In general: compute “errors” at the i-‐th level from all -‐s at the i +1-‐th level – propagate the error.

The Algorithm (for just one example ):

1. Forward: compute all and (apply the network), compute the output error ;

2. Backward: compute errors in the intermediate levels:

3. Compute the gradient and go.

For many examples – just sum them up.

(8)

Time Delay Neural Networks (TDNN)

Feed-‐Forward network of a particular architecture.

Many equivalent “parts” (i.e. of the same structure with the same weights), but having different Receptive Fields. The output level of each part gives an information about the signal in the

corresponding receptive field – computation of local features.

Problem: During the Error Back-‐Propagation the equivalence gets lost. Solution: average the gradients.

(9)

Convolutional Networks

Local features – convolutions with a set of predefined/learned masks (convolution kernels)

(10)

Convolutional Networks

Yann LeCun, Koray Kavukcuoglu and Clement Farabet Convolu'onal Networks and Applica'ons in Vision

(11)

Hopfield Networks

There is a symmetric neighborhood relation (e.g. a grid).

The output of each neuron serves as inputs for the neighboring ones.

with symmetric weights, i.e.

A network configuration is a mapping

A configuration is stable if “outputs do not contradict”

The Energy of a configuration is

(12)

Hopfield Networks

Network dynamic:

1. Start with an arbitrary configuration ,

2. Decide for each neuron whether it should be activated or not according to

Do it sequentially for all neurons until convergence, i.e. apply the changes immediately.

In doing so the energy increases !!!

Attention!!! It does not work with the parallel dynamic (seminar).

(13)

Hopfield Networks

During the sequential dynamic the energy may only increase ! Proof:

Consider the energy “part” that depend on a particular neuron:

After the decision the energy difference is

If , the new output is set to 1 → energy grows.

If , the new output is set to 0 → energy grows too.

(14)

Hopfield Networks

The network dynamic is the simplest method to find a configuration of the maximal energy (synonym – “Iterated Conditional Modes”).

The network dynamic is not globally optimal, it stops at a stable configuration, i.e. a local maxima of the Energy.

The most stable configuration – global maximum.

The task (find the global maximum) is NP-‐complete in general.

Polynomial solvable special cases:

1. The neighborhood structure is simple – e.g. a tree

2. All weights are non-‐negative (supermodular energies).

Of course, nowadays there are many good approximations.

(15)

Hopfield Networks

Hopfield Network with external input :

The energy is

Hopfield Networks implement mappings according to the principle of Energy maximum:

Note: no single output but a configuration – structured output.

(16)

Hopfield Networks

Hopfield Networks model patterns

– network configurations of the optimal energy.

Example:

Let be a network configuration and the number of “cracks” – pairs of neighboring neurons of different outputs.

Design a network (weights and biases for each neuron) so that the energy of a configuration is proportional to the number of cracks, i.e. .

(17)

Hopfield Networks

Solution: (up to the borders) Further examples at the seminar.

(18)

Stochastic extensions

“Usual” neurons represent deterministic mappings

Stochastic neuron (with the sigmoid transfer function) represents the posterior probability distribution for output given the input

i.e. the logistic regression.

The output is not computed from the output but “sampled”

according to the corresponding probability distribution.

y = sign hx, wi + b

p(y=1|x) = 1

1 + exp hx, wi + b

(19)

Stochastic extensions

An (arbitrary) neural network of stochastic neurons is called

Boltzmann Machine, the corresponding probability distribution is called a Boltzmann distribution.

Restricted Boltzmann Machine is a network that is not fully connected. Example: combined “classifier” from the AdaBoost lecture, but with stochastic neurons.

Deep Boltzmann Machine is a hierarchical restricted one — Feed-‐

Forward network with stochastic neurons. It is used to model/learn very complex posteriors.

See papers (books, scripts, video lectures etc.) by Geoffrey Hinton

(20)

Stochastic extensions

Hopfield networks of stochastic neurons represent Gibbs probability distributions (aka Markovian Random Fields — MRF)

(remember the structured output).

Looking forward to see you at “Machine Learning II” in the next semester :-‐).

p(y) / exp E(y) = exp

"

X

rr⁰

w_rr⁰ · y_r · y_r⁰ + X

r

b_r · y_r

#

Machine Learning