1 Denition of our Neural Network

(1)

1 Denition of our Neural Network

Consider the following network: The input consists of DI dimensions, there is one hidden layer with dimensionDHand the output has dimensionDO. Given two matrices⁽¹⁾2R^(DÎ⁺¹⁾^D^Hand ⁽²⁾2R^(D^H⁺¹⁾^DÔ, the forward propagation of an input vectorx2R^DÎ(a column vector) works in the following way:

a⁽¹⁾ = ¡ ⁽¹⁾_T

1

x

2R^D^H z = ¡

a⁽¹⁾

2R^D^H a⁽²⁾ = ¡

⁽²⁾_T

1 z

2R^D^O y = ¡

a⁽²⁾

2R^D^O

For training purposes (and for computational issues) we will want to propagate multiple input vectorsx1; x2; :::; xMwith everyxm2R^D^Iat the same time. We collect them into a big input matrix

X= (x1; x2; :::; xM)2R^D^I^M

Each column represents a dierent input sample, each row contains an input feature (for example a pixel in a scanned image). Forward propagation then looks as follows:

A⁽¹⁾ = ¡ ⁽¹⁾_T

1

X

2R^D^H^M Z = ¡

A⁽¹⁾

2R^D^H^M A⁽²⁾ = ¡

⁽²⁾_T

1 Z

2R^D^O^M Y = ¡

A⁽²⁾

2R^D^O^M where is the sigmoid function(a) =_{1 +}_exp(¹

¡a)

So in short

Y= 8<

:

¡⁽²⁾_T

0

@ 1

h¡

⁽¹⁾_T

_X¹ i 1 A 9=

;

The ones of course represent rows containing M ones to matchX and Z. For easier calculation we collect the following submatrix denitions:

⁽¹⁾ = 0 B@

₀⁽¹⁾ _D⁽¹⁾_I

1

CA, where ⁽¹⁾_j _T

2R^D^H

⁽²⁾ = 0 B@

₀⁽²⁾ _D⁽²⁾_H

1

CA, where ⁽²⁾_j _T

2R^D^O

X = (X1; :::; XM) Xm =

0

@ X1;m

XD_I;m

1 A

1

(2)

Z = (Z1; :::; ZM) Zm =

0

@ Z1;m

ZD_H;m

1 A

Y = (Y1; :::; YM) Ym =

0

@ Y1;m

YD_O;m

1 A A⁽¹⁾ =

A₁⁽¹⁾; :::; A_M⁽¹⁾

A_m⁽¹⁾ = 0 BB

@ A_1;m⁽¹⁾

A_D⁽¹⁾_H_;m

1 CC A A⁽²⁾ =

A₁⁽²⁾; :::; A_M⁽²⁾

A_m⁽²⁾ = 0 BB

@ A_1;m⁽²⁾

A_D⁽²⁾_O_;m

1 CC A

We will stick to the convention, that the summing indices are as follows:

i= (0;)1; :::; DI

j= (0;)1; :::; DH

k= 1; :::; DO

m= 1; :::; M

where the 0th index always denotes the bias variable (soX0;m=Y0;m= 1for allm= 1; :::; M)

2 Training the Neural Network

Neural Networks are supervised machine learning algorithms: We need a training set with the correct results to tune the weights^(r); r= 1;2. The training set is

X= (X1; :::; XM)is the training set input (and for convenience of lengthM). In our example, eachXmdescribes the set of pixels in the image to be resolved,

Y = (Y1; :::; YM)is the training set output (for the currently incorrect chosen weights ^(r)), i.e. the Neural Network's current guess for the digit the image contains,

L= (L1; :::; LM)is the given correct result, i.e. Lmcontains information which digit the image actually shows.Lfollows the same notation as Y, soLmis a column vector containing L1;m; :::; LD_O;m.

The training is supposed to tune the weights ^(r) so that L Y where the approximation is interpreted in terms of the following error function (called cross-entropy error function):

E() =¡X

m=1

M X

k=1 D_O

Lk;mlnYk;m+ (1¡Lk;m)ln(1¡Yk;m)

2

(3)

Our objective is to minimize the error function, i.e. nd a set of weights^(r)so thatE()is small, which means thatLY. This will be done in terms of a gradient descent algorithm, so we need the gradient of E with respect to.

3 Learning by Backward Propagation

The gradients of E can be obtained in a reversed fashion, i.e. we start fromY and L and work our way back. This is called Backward Propagation and is done as follows. First notice that

@E()

@_l^(r)₁_;l₂=¡X

m=1

M X

k=1 D_O

Lk;m¡Yk;m

Yk;m(1¡Yk;m) @Yk;m

@_l^(r)₁_;l₂ (1) The rst step of Backward Propagation is the last step of Forward Propagation: Consider rst

Yk;m= 8<

: X

j=0 DH

_j;k⁽²⁾Zj;m

9=

;, whereZ0;m= 1; (2)

so

@Yk;m

@_j;k⁽²⁾ = ⁰ 8<

: X

j=0 D_H

_j;k⁽²⁾Zj;m

9=

;Zj;m

= ⁰ A_k;m⁽²⁾

Zj;m; (3)

now we notice that⁰(a) =(a) (1¡(a))andYk;m= ¡

A⁽²⁾

k;m= A_k;m⁽²⁾

, hence ⁰

A_k;m⁽²⁾

=Yk;m(1¡Yk;m): (4) Combined, we get

@E()

@⁽²⁾_j;k = ¡X

m=1

M Lk;m¡Yk;m

Yk;m(1¡Yk;m)Yk;m(1¡Yk;m)Zj;m

= ¡X

m=1 M

(Lk;m¡Yk;m)Zj;m (5)

It will be useful to abbreviate

_m⁽³⁾ = Ym¡Lm

_k;m⁽³⁾ = Yk;m¡Lk;m;

hence

@E()

@_j;k⁽²⁾ =X

m=1 M

_k;m⁽³⁾ Zj;m; (6)

or (for compactness fetishists)

@E()

@⁽²⁾ =X

m=1 M

_m⁽³⁾ZmT

3

(4)

ExpandingYk;m in terms of⁽¹⁾gives us

Yk;m = 8<

: X

j=0 D_H

⁽²⁾_j;kZj;m

9=

; Z0;m = 1

Zj;m = (X

i=0 DI

_{i; j}⁽¹⁾Xi;m

)

; X0;m = 1

Derivation with respect to⁽¹⁾:

@Yk;m

@_{i; j}⁽¹⁾ = ⁰ A_k;m⁽²⁾

_j;k⁽²⁾@Zj;m

@_{i; j}⁽¹⁾

= Yk;m(1¡Yk;m)⁰ (X

i=0 DI

_{i; j}⁽¹⁾Xi;m

) Xi;m

= Yk;m(1¡Yk;m)⁰ A_j;m⁽¹⁾

⁽²⁾_j;kXi;m

@E()

@_{i; j}⁽¹⁾ = ¡X

m=1 M

Xi;m⁰ A⁽¹⁾_j;m

X

k=1 D_O

(Lk;m¡Yk;m)_j;k⁽²⁾

= ¡X

m=1 M

Xi;m⁰ A⁽¹⁾_j;m

_j⁽²⁾(Lm¡Ym)

= X

m=1 M

⁽²⁾_j _m⁽³⁾⁰ A_j;m⁽¹⁾

Xi;m

Hence,

@E()

@_{i; j}⁽¹⁾ =X

m=1 M

_j⁽²⁾_m⁽³⁾⁰ A_j;m⁽¹⁾

Xi;m (7)

Using some gradient descent algorithm, we can nd a local minimum for the cost function, i.e.

problem adapted values for.

4

1 Denition of our Neural Network