Neural Networks
Stefan Edelkamp
1 Overview
- Introduction - Percepton - Hofield-Nets
- Self-Organizing Maps
- Feed-Forward Neural Networks - Backpropagation
2 Introduction
Idea: Mimic principle of biological neural networks with artificial neural networks
1
2
3 4 5
6
7 8
9
- adapt settled solutions of nature - parallelization ⇒ high performance - redundancy ⇒ tolerance for failures
Ingrediences
Needs for an artificial neural network:
• behavior artificial neurons
• order of computation
• activation function
• structure of the net (topology)
• recurrent nets
• feed-forward nets
• integration in environment
• learning algorithm
Percepton-Learning
. . . very simple network with no hidden neurons Inputs: x, weighted with w, weights added
Activating Function: Θ
Output: z, determined by computing Θ(wTx)
Additional: weighted input representing constant 1
Training
f : M ⊂ IRd → {0,1} net function
1. initialize counter i and initial weight vector w0 to 0
2. as long as there are vectors for which wix ≤ 0 set wi+1 to wi + x and increase i by 1
3. return wi+1
Termination on Training Data
Assume vector w to be normalized, and w∗ to be final with ||w∗|| = 1
- f = Θ((x,1)w∗) , constants δ and γ with |(x,1)w∗| ≥ δ and ||(x,1)|| ≤ γ - for angle between wi and w∗ we have 1 ≥ cosαi = wTi w∗/||wi||
- wi+1T w∗ = (wi + xi)Tw∗ = wiTw∗ + xTi w∗ - xT0w∗ ≥ δ ⇒ wi+1T w∗ ≥ δ(i + 1)
- ||wi+1|| =
q
(wi + xi)T(wi + xi) ≤ q||wi||2 + ||xi||2 + 2wTi xi ≤
q
||wi||2 + γ2 ≤ γ√
i + 1 (Induction: ||wi|| ≤ γ√ i)
⇒ cosαi ≥ δ√
i + 1/γ converges to ∞ (as i goes to ∞)
3 Hopfield Nets
Neurons: 1 2 . . . d
Activations: x1 x2 . . . xd ; xi ∈ {0,1}
Connections: wij ∈ IR (1 ≤ i, j ≤ d) with
wii = 0, wij = wji ⇒ W := hwiji
d×d
Update: asynchronous & stochastic x0j :=
0 if Pdi=1xiwij < 0 1 if Pdi=1xiwij > 0 xj otherwise
Example
x3
1
−2 x1
3 x2
W =
0 1 −2 1 0 3
−2 3 0
Use:
• associative memory
• computing Boolean functions
• combinatorical optimization
Energy of a Hopfield-Net
x = (x1, x2, . . . , xd)T ⇒ E(x) := −12 xT W x = −Pi<j xi wij xj be the energy of a Hopfield net
Theorem Every update, which changes the Hopfield-Netz, reduces the energy.
Proof Assume: Update changes xk into x0k ⇒ E(x) − E(x0) = − X
i<j
xi wij xj + X
i<j
x0i wij x0j
= − X
j6=k
xk wkj xj + X
j6=k
x0k wkj x0j
=x|{z}j
= −xk + (−x0k) X
j6=k
wkj xj > 0
Solving a COP
Input: Combinatorial Optimization Problem (COP) Output: Solution for COP
Algorithm:
• select Hopfield net with parameters of COP as weights and solution at minimum of energy
• start net with random activation
• computer sequence of updates until stabilization
• read parameters
• test feasibility and optimality of solution
Multi Flop-Problem
Problem Instance: k, n ∈ IN , k < n
Feasible Solutions: ˜x = (x1, . . . , xn) ∈ {0,1}n Objective Function: P(˜x) = Pni=1 xi
Optimal Solution: solution ˜x with P(˜x) = k
Minimization Problem: d = n + 1, xd = 1, x = (x1, x2, . . . , xn, xd)T ⇒ E(x) = Xd
i=1 xi − (k + 1)2
=
d X
i=1
x2i
=x|{z}i
+ X
i6=j
xi xj − 2(k + 1)
d X
i=1
xi + (k + 1)2
= X
i6=j
xi xj − (2k + 1)
d−1 X
i=1
xi xd + k2
= −1 2
X
i<j
xi(−4)xj − 1 2
X
i<d
xi (4k + 2)xd + k2
Example
(n = 3, k = 1):
x1
x3
−2
−2
1
1 1
−2 x2 x4
Traveling Salesperson-Problem (TSP)
Problem Instance:
Cities: 1 2 . . . n
Distances: dij ∈ IR+ (1 ≤ i, j ≤ n) with dii = 0 Feasible Solution: permutation π of (1,2, . . . , n)
Objective Functions: P(π) = Pni=1dπ(i),π(i mod n+1)
Optimal Solutions: feasible solution π with minimal P(π)
Encoding
Idea: Hopfield-Net with d = n2 + 1 neurons:
− +
−d32
−d12
−d23
−d21
π(i)
i
Problem: ”‘Size”’ of the weights to allow both feasible and good solutions
Trick: Transition to continuous Hopfield-Net and modified weights ⇒ good solution of TSP
4 Self-Organizing Maps (SOM)
Neurons:
Input: 1,2, . . . , d for components xi
Map: 1,2, . . . , m; regular (linear-, rectangular, hexagonial-) Grid r to store pattern vectors µi ∈ IRd
Output:1,2, . . . , d for µc Update:
L ⊂ IRd, learning set; at time t ∈ IN+, x ∈ L is chosen by random ⇒ c ∈ {1, . . . , m} determined with
kx − µck ≤ kx − µik (∀i ∈ {1, . . . , m})
and adapted to the pattern: µ0i := µi + h(c, i, t) (x − µi) ∀i ∈ {1, . . . , m}
with h(c, i, t) time-dependent neighborhood-relation
and h(c, i, t) → 0 for t → ∞, e.g.: h(c, i, t) = α(t) · exp −krc − rik2 2σ(t)2
!
Application of SOM
. . . include:
visualization and interpretation, dimension reduction schemes, clustering, and
classification, COPs . . .
A size- 50 map adapts to a triangle
A 15 × 15 -Grid is adapted to a triangle
SOM for Combinatorial Optimization
∆-TSP
Idea: Use growing ring (elastic band) of neurons
Tests with n ≤ 2392 show that the running time scales linearly and deviates from the optimum by less than 9 %
SOM for Combinatorial Optimization
10 neurons 50 neurons
SOM for Combinatorial Optimization
Tour with 2526 neurons:
5 Layered Feed-Forward Nets (MLP)
1 2 3
Formalization
A L-layered MLP (multi-layered perceptron) Layer: S0, S1, . . . , SL−1, SL
Connection: Of each neuron i in S` to j in S`+1 with weight wij, exept 1-neurons
Update: layer-wise synchronously mixed x0j := ϕ
X
i∈V(j) xi wij
with ϕ differenciable,
z.B. ϕ(a) = σ(a) = 1+exp(−a)1
5 -5
Layered Feed-Forward Nets
Applications: function approximation, classification
Theorem: All Boolean functions can be computed with a 2-layered MLP (no proof)
Theorem: continuous real functions and their derivatives can be jointly approximated to an arbitrary precision on a compact sets
(no proof)
Learning Parameters in MLP
Given: x1, . . . , xN ∈ IRd und t1, . . . , tN ∈ IRc, MLP with d input and c output neurons,
w = w1, . . . , wM contains all weights, f(x,w) is the net function Task: find optimal w∗, that minimizes the error
E(w) := 1 2
N X
n=1 c X
k=1
fk(xn,w) − tnk2
partial derivatives of f exists with respect to the inputs and the parameters
⇒ any gradient-based optimization methods can be used (conjugated gradient, . . . )
∇wE(w) =
N X
c
X fk(xn,w) − tnk ∇wfk(xn,w)
Backpropagation
Basic Calculus:
∂
∂t f(g(t))
t=t
0
= ∂
∂s f(s)
s=g(t
0)
! ∂
∂t g(t)
t=t
0
!
Example: ϕ(a) := 9 − a2, x = (1,2)T, w = (1,1)T, t = 2:
w2
∗
x1 ∗
x2
+ w1
.2
− 2
ϕ
t
f E
∇wE(w)|w=(1,1)T =
h(x, y) = x ∗ y ⇒ ∂/∂x h(x, y) = y h(x, y) = x + y ⇒ ∂/∂x h(x, y) = 1 h(x, y) = x − y ⇒ ∂/∂x h(x, y) = 1 ϕ(x) = 9 − x2 ⇒ ∂/∂x ϕ(x) = −2x
h(x) = x2/2 ⇒ ∂/∂x h(x) = x
Backpropagation
Theorem: ∇wE(w) can be computed in time O(N × M) if network is of size O(M)
Algorithm:
∀n ∈ {1, . . . , N}
• compute net functions f(xn,w) and associated error E in forward direction store values in net
• compute partial derivatives of E with respect to all intermediates in backward direction and add all parts for total gradient