1 Denition of our Neural Network
Consider the following network: The input consists of DI dimensions, there is one hidden layer with dimensionDHand the output has dimensionDO. Given two matrices(1)2R(DI+1)DHand (2)2R(DH+1)DO, the forward propagation of an input vectorx2RDI(a column vector) works in the following way:
a(1) = ¡ (1)T
1
x
2RDH z = ¡
a(1)
2RDH a(2) = ¡
(2)T
1 z
2RDO y = ¡
a(2)
2RDO
For training purposes (and for computational issues) we will want to propagate multiple input vectorsx1; x2; :::; xMwith everyxm2RDIat the same time. We collect them into a big input matrix
X= (x1; x2; :::; xM)2RDIM
Each column represents a dierent input sample, each row contains an input feature (for example a pixel in a scanned image). Forward propagation then looks as follows:
A(1) = ¡ (1)T
1
X
2RDHM Z = ¡
A(1)
2RDHM A(2) = ¡
(2)T
1 Z
2RDOM Y = ¡
A(2)
2RDOM where is the sigmoid function(a) =1 +exp(1
¡a)
So in short
Y= 8<
:
¡(2)T
0
@ 1
h¡
(1)T
X1 i 1 A 9=
;
The ones of course represent rows containing M ones to matchX and Z. For easier calculation we collect the following submatrix denitions:
(1) = 0 B@
0(1) D(1)I
1
CA, where (1)j T
2RDH
(2) = 0 B@
0(2) D(2)H
1
CA, where (2)j T
2RDO
X = (X1; :::; XM) Xm =
0
@ X1;m
XDI;m
1 A
1
Z = (Z1; :::; ZM) Zm =
0
@ Z1;m
ZDH;m
1 A
Y = (Y1; :::; YM) Ym =
0
@ Y1;m
YDO;m
1 A A(1) =
A1(1); :::; AM(1)
Am(1) = 0 BB
@ A1;m(1)
AD(1)H;m
1 CC A A(2) =
A1(2); :::; AM(2)
Am(2) = 0 BB
@ A1;m(2)
AD(2)O;m
1 CC A
We will stick to the convention, that the summing indices are as follows:
i= (0;)1; :::; DI
j= (0;)1; :::; DH
k= 1; :::; DO
m= 1; :::; M
where the 0th index always denotes the bias variable (soX0;m=Y0;m= 1for allm= 1; :::; M)
2 Training the Neural Network
Neural Networks are supervised machine learning algorithms: We need a training set with the correct results to tune the weights(r); r= 1;2. The training set is
X= (X1; :::; XM)is the training set input (and for convenience of lengthM). In our example, eachXmdescribes the set of pixels in the image to be resolved,
Y = (Y1; :::; YM)is the training set output (for the currently incorrect chosen weights (r)), i.e. the Neural Network's current guess for the digit the image contains,
L= (L1; :::; LM)is the given correct result, i.e. Lmcontains information which digit the image actually shows.Lfollows the same notation as Y, soLmis a column vector containing L1;m; :::; LDO;m.
The training is supposed to tune the weights (r) so that L Y where the approximation is interpreted in terms of the following error function (called cross-entropy error function):
E() =¡X
m=1
M X
k=1 DO
Lk;mlnYk;m+ (1¡Lk;m)ln(1¡Yk;m)
2
Our objective is to minimize the error function, i.e. nd a set of weights(r)so thatE()is small, which means thatLY. This will be done in terms of a gradient descent algorithm, so we need the gradient of E with respect to.
3 Learning by Backward Propagation
The gradients of E can be obtained in a reversed fashion, i.e. we start fromY and L and work our way back. This is called Backward Propagation and is done as follows. First notice that
@E()
@l(r)1;l2=¡X
m=1
M X
k=1 DO
Lk;m¡Yk;m
Yk;m(1¡Yk;m) @Yk;m
@l(r)1;l2 (1) The rst step of Backward Propagation is the last step of Forward Propagation: Consider rst
Yk;m= 8<
: X
j=0 DH
j;k(2)Zj;m
9=
;, whereZ0;m= 1; (2)
so
@Yk;m
@j;k(2) = 0 8<
: X
j=0 DH
j;k(2)Zj;m
9=
;Zj;m
= 0 Ak;m(2)
Zj;m; (3)
now we notice that0(a) =(a) (1¡(a))andYk;m= ¡
A(2)
k;m= Ak;m(2)
, hence 0
Ak;m(2)
=Yk;m(1¡Yk;m): (4) Combined, we get
@E()
@(2)j;k = ¡X
m=1
M Lk;m¡Yk;m
Yk;m(1¡Yk;m)Yk;m(1¡Yk;m)Zj;m
= ¡X
m=1 M
(Lk;m¡Yk;m)Zj;m (5)
It will be useful to abbreviate
m(3) = Ym¡Lm
k;m(3) = Yk;m¡Lk;m;
hence
@E()
@j;k(2) =X
m=1 M
k;m(3) Zj;m; (6)
or (for compactness fetishists)
@E()
@(2) =X
m=1 M
m(3)ZmT
3
ExpandingYk;m in terms of(1)gives us
Yk;m = 8<
: X
j=0 DH
(2)j;kZj;m
9=
; Z0;m = 1
Zj;m = (X
i=0 DI
i; j(1)Xi;m
)
; X0;m = 1
Derivation with respect to(1):
@Yk;m
@i; j(1) = 0 Ak;m(2)
j;k(2)@Zj;m
@i; j(1)
= Yk;m(1¡Yk;m)0 (X
i=0 DI
i; j(1)Xi;m
) Xi;m
= Yk;m(1¡Yk;m)0 Aj;m(1)
(2)j;kXi;m
@E()
@i; j(1) = ¡X
m=1 M
Xi;m0 A(1)j;m
X
k=1 DO
(Lk;m¡Yk;m)j;k(2)
= ¡X
m=1 M
Xi;m0 A(1)j;m
j(2)(Lm¡Ym)
= X
m=1 M
(2)j m(3)0 Aj;m(1)
Xi;m
Hence,
@E()
@i; j(1) =X
m=1 M
j(2)m(3)0 Aj;m(1)
Xi;m (7)
Using some gradient descent algorithm, we can nd a local minimum for the cost function, i.e.
problem adapted values for.
4