Neural Networks

(1)

Neural Networks

Stefan Edelkamp

(2)

1 Overview

- Introduction - Percepton - Hofield-Nets

- Self-Organizing Maps

- Feed-Forward Neural Networks - Backpropagation

(3)

2 Introduction

Idea: Mimic principle of biological neural networks with artificial neural networks

1

2

3 4 5

6

7 8

9

- adapt settled solutions of nature - parallelization ⇒ high performance - redundancy ⇒ tolerance for failures

(4)

Ingrediences

Needs for an artificial neural network:

• behavior artificial neurons

• order of computation

• activation function

• structure of the net (topology)

• recurrent nets

• feed-forward nets

• integration in environment

• learning algorithm

(5)

Percepton-Learning

. . . very simple network with no hidden neurons Inputs: x, weighted with w, weights added

Activating Function: Θ

Output: z, determined by computing Θ(w^Tx⁾

Additional: weighted input representing constant 1

(6)

Training

f : M ⊂ IR^d → {0,1} net function

1. initialize counter i and initial weight vector w₀ to 0

2. as long as there are vectors for which w_ix ≤ 0 set w_i+1 to w_i ⁺ x and increase i by 1

3. return w_i+1

(7)

Termination on Training Data

Assume vector w to be normalized, and w^∗ to be final with ||w^∗|| = 1

- f = Θ((x,1)w^∗⁾ , constants δ and γ with |(x,1)w^∗| ≥ δ and ||(x,1)|| ≤ γ - for angle between w_i and w^∗ we have 1 ≥ cosα_i = w^T_i w^∗/||w_i||

- w_i+1^T w^∗ ^{= (}w_i ⁺ x_i⁾^Tw^∗ ⁼ w_i^Tw^∗ ⁺ x^T_i w^∗ - x^T₀w^∗ ≥ δ ⇒ w_i+1^T w^∗ ≥ δ(i + 1)

- ||w_i+1|| =

q

(w_i ⁺ x_i⁾^T⁽w_i ⁺ x_i⁾ ≤ ^q||w_i||² + ||x_i||² + 2w^T_i x_i ≤

q

||w_i||² + γ² ≤ γ√

i + 1 (Induction: ||w_i|| ≤ γ√ i)

⇒ cosα_i ≥ δ√

i + 1/γ converges to ∞ (as i goes to ∞)

(8)

3 Hopfield Nets

Neurons: 1 2 . . . d

Activations: x₁ x₂ . . . x_d ; x_i ∈ {0,1}

Connections: w_ij ∈ IR (1 ≤ i, j ≤ d) with

w_ii = 0, w_ij = w_ji ⇒ W ^:= ^hw_ijⁱ

d×d

Update: asynchronous & stochastic x⁰_j :=











0 if ^P^d_i=1x_iw_ij < 0 1 if ^P^d_i=1x_iw_ij > 0 x_j otherwise

(9)

Example

x3

1

−2 x1

3 x2

W ⁼







0 1 −2 1 0 3

−2 3 0







Use:

• associative memory

• computing Boolean functions

• combinatorical optimization

(10)

Energy of a Hopfield-Net

x = (x₁, x₂, . . . , x_d)^T ⇒ E(x^{) :=} −¹₂ x^T W x ⁼ −^P_i<j x_i w_ij x_j be the energy of a Hopfield net

Theorem Every update, which changes the Hopfield-Netz, reduces the energy.

Proof Assume: Update changes x_k into x⁰_k ⇒ E(x⁾ − E(x⁰^{) =} − ^X

i<j

x_i w_ij x_j + ^X

i<j

x⁰_i w_ij x⁰_j

= − ^X

j6=k

x_k w_kj x_j + ^X

j6=k

x⁰_k w_kj x⁰_j

=x|{z}_j

= −x_k + (−x⁰_k) ^X

j6=k

w_kj x_j > 0

(11)

Solving a COP

Input: Combinatorial Optimization Problem (COP) Output: Solution for COP

Algorithm:

• select Hopfield net with parameters of COP as weights and solution at minimum of energy

• start net with random activation

• computer sequence of updates until stabilization

• read parameters

• test feasibility and optimality of solution

(12)

Multi Flop-Problem

Problem Instance: k, n ∈ IN , k < n

Feasible Solutions: ˜x = (x₁, . . . , x_n) ∈ {0,1}ⁿ Objective Function: P(˜x^{) =} ^Pⁿ_i=1 x_i

Optimal Solution: solution ˜x with P(˜x^{) =} k

Minimization Problem: d = n + 1, x_d = 1, x = (x₁, x₂, . . . , x_n, x_d)^T ⇒ E(x^{) =} ^X^d

i=1 x_i − (k + 1)²

=

d X

i=1

x²_i

=x|{z}_i

+ ^X

i6=j

x_i x_j − 2(k + 1)

d X

i=1

x_i + (k + 1)²

= ^X

i6=j

x_i x_j − (2k + 1)

d−1 X

i=1

x_i x_d + k²

= −1 2

X

i<j

x_i(−4)x_j − 1 2

X

i<d

x_i (4k + 2)x_d + k²

(13)

Example

(n = 3, k = 1):

x1

x3

−2

1

1 1

−2 x2 x4

(14)

Traveling Salesperson-Problem (TSP)

Problem Instance:

Cities: 1 2 . . . n

Distances: d_ij ∈ IR⁺ (1 ≤ i, j ≤ n) with d_ii = 0 Feasible Solution: permutation π of (1,2, . . . , n)

Objective Functions: P(π) = ^Pⁿ_i=1d_π(i),π(i _mod _n+1)

Optimal Solutions: feasible solution π with minimal P(π)

(15)

Encoding

Idea: Hopfield-Net with d = n² + 1 neurons:

− +

−d32

−d12

−d23

−d21

π(i)

i

Problem: ”‘Size”’ of the weights to allow both feasible and good solutions

Trick: Transition to continuous Hopfield-Net and modified weights ⇒ good solution of TSP

(16)

4 Self-Organizing Maps (SOM)

Neurons:

Input: 1,2, . . . , d for components x_i

Map: 1,2, . . . , m; regular (linear-, rectangular, hexagonial-) Grid r to store pattern vectors µ_i ∈ IR^d

Output:1,2, . . . , d for µ_c Update:

L ⊂ IR^d, learning set; at time t ∈ IN⁺, x ∈ L is chosen by random ⇒ c ∈ {1, . . . , m} determined with

kx − µ_ck ≤ kx − µ_ik (∀i ∈ {1, . . . , m})

and adapted to the pattern: µ⁰_i ^:= µ_i ⁺ h(c, i, t) (x − µ_i) ∀i ∈ {1, . . . , m}

with h(c, i, t) time-dependent neighborhood-relation

and h(c, i, t) → 0 for t → ∞, e.g.: h(c, i, t) = α(t) · exp −kr_c − r_ik² 2σ(t)²

!

(17)

Application of SOM

. . . include:

visualization and interpretation, dimension reduction schemes, clustering, and

classification, COPs . . .

(18)

A size- 50 map adapts to a triangle

(19)

A 15 × 15 -Grid is adapted to a triangle

(20)

(21)

(22)

SOM for Combinatorial Optimization

∆-TSP

Idea: Use growing ring (elastic band) of neurons

Tests with n ≤ 2392 show that the running time scales linearly and deviates from the optimum by less than 9 %

(23)

SOM for Combinatorial Optimization

(24)

10 neurons 50 neurons

(25)

(26)

SOM for Combinatorial Optimization

Tour with 2526 neurons:

(27)

(28)

5 Layered Feed-Forward Nets (MLP)

1 2 3

(29)

Formalization

A L-layered MLP (multi-layered perceptron) Layer: S₀, S₁, . . . , S_L−1, S_L

Connection: Of each neuron i in S_` to j in S_`+1 with weight w_ij, exept 1-neurons

Update: layer-wise synchronously mixed x⁰_j := ϕ

X

i∈V(j) x_i w_ij

with ϕ differenciable,

z.B. ϕ(a) = σ(a) = _1+exp(−a)¹

5 -5

(30)

Layered Feed-Forward Nets

Applications: function approximation, classification

Theorem: All Boolean functions can be computed with a 2-layered MLP (no proof)

Theorem: continuous real functions and their derivatives can be jointly approximated to an arbitrary precision on a compact sets

(no proof)

(31)

Learning Parameters in MLP

Given: x¹^{, . . . ,} x^N ∈ IR^d und t¹^{, . . . ,} t^N ∈ IR^c, MLP with d input and c output neurons,

w ⁼ w₁, . . . , w_M contains all weights, f(x,w⁾ is the net function Task: find optimal w^∗, that minimizes the error

E(w^{) :=} ¹ 2

N X

n=1 c X

k=1

f_k(xⁿ,w⁾ − tⁿ_k²

partial derivatives of f exists with respect to the inputs and the parameters

⇒ any gradient-based optimization methods can be used (conjugated gradient, . . . )

∇_wE(w^{) =}

N X

c

X f_k(xⁿ,w⁾ − tⁿ_k ∇_wf_k(xⁿ,w⁾

(32)

Backpropagation

Basic Calculus:

∂

∂t f(g(t))

_t=t

0

= ∂

∂s f(s)

_s=g(t

0)

! ∂

∂t g(t)

_t=t

0

!

Example: ϕ(a) := 9 − a², x ^{= (1,}²⁾^T, w ^{= (1,}¹⁾^T, t = 2:

w2

∗

x1 ∗

x2

+ w1

.²

− 2

ϕ

t

f E

(33)

∇_wE(w)|_w_=(1,1)T =

h(x, y) = x ∗ y ⇒ ∂/∂x h(x, y) = y h(x, y) = x + y ⇒ ∂/∂x h(x, y) = 1 h(x, y) = x − y ⇒ ∂/∂x h(x, y) = 1 ϕ(x) = 9 − x² ⇒ ∂/∂x ϕ(x) = −2x

h(x) = x²/2 ⇒ ∂/∂x h(x) = x

(34)

Backpropagation

Theorem: ∇_wE(w⁾ can be computed in time O(N × M) if network is of size O(M)

Algorithm:

∀n ∈ {1, . . . , N}

• compute net functions f(xⁿ,w⁾ and associated error E in forward direction store values in net

• compute partial derivatives of E with respect to all intermediates in backward direction and add all parts for total gradient