• Keine Ergebnisse gefunden

1 Denition of our Neural Network

N/A
N/A
Protected

Academic year: 2022

Aktie "1 Denition of our Neural Network"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

1 Denition of our Neural Network

Consider the following network: The input consists of DI dimensions, there is one hidden layer with dimensionDHand the output has dimensionDO. Given two matrices(1)2R(DI+1)DHand (2)2R(DH+1)DO, the forward propagation of an input vectorx2RDI(a column vector) works in the following way:

a(1) = ¡ (1)T

1

x

2RDH z = ¡

a(1)

2RDH a(2) = ¡

(2)T

1 z

2RDO y = ¡

a(2)

2RDO

For training purposes (and for computational issues) we will want to propagate multiple input vectorsx1; x2; :::; xMwith everyxm2RDIat the same time. We collect them into a big input matrix

X= (x1; x2; :::; xM)2RDIM

Each column represents a dierent input sample, each row contains an input feature (for example a pixel in a scanned image). Forward propagation then looks as follows:

A(1) = ¡ (1)T

1

X

2RDHM Z = ¡

A(1)

2RDHM A(2) = ¡

(2)T

1 Z

2RDOM Y = ¡

A(2)

2RDOM where is the sigmoid function(a) =1 +exp(1

¡a)

So in short

Y= 8<

:

¡(2)T

0

@ 1

(1)T

X1 i 1 A 9=

;

The ones of course represent rows containing M ones to matchX and Z. For easier calculation we collect the following submatrix denitions:

(1) = 0 B@

0(1) D(1)I

1

CA, where (1)j T

2RDH

(2) = 0 B@

0(2) D(2)H

1

CA, where (2)j T

2RDO

X = (X1; :::; XM) Xm =

0

@ X1;m

XDI;m

1 A

1

(2)

Z = (Z1; :::; ZM) Zm =

0

@ Z1;m

ZDH;m

1 A

Y = (Y1; :::; YM) Ym =

0

@ Y1;m

YDO;m

1 A A(1) =

A1(1); :::; AM(1)

Am(1) = 0 BB

@ A1;m(1)

AD(1)H;m

1 CC A A(2) =

A1(2); :::; AM(2)

Am(2) = 0 BB

@ A1;m(2)

AD(2)O;m

1 CC A

We will stick to the convention, that the summing indices are as follows:

i= (0;)1; :::; DI

j= (0;)1; :::; DH

k= 1; :::; DO

m= 1; :::; M

where the 0th index always denotes the bias variable (soX0;m=Y0;m= 1for allm= 1; :::; M)

2 Training the Neural Network

Neural Networks are supervised machine learning algorithms: We need a training set with the correct results to tune the weights(r); r= 1;2. The training set is

X= (X1; :::; XM)is the training set input (and for convenience of lengthM). In our example, eachXmdescribes the set of pixels in the image to be resolved,

Y = (Y1; :::; YM)is the training set output (for the currently incorrect chosen weights (r)), i.e. the Neural Network's current guess for the digit the image contains,

L= (L1; :::; LM)is the given correct result, i.e. Lmcontains information which digit the image actually shows.Lfollows the same notation as Y, soLmis a column vector containing L1;m; :::; LDO;m.

The training is supposed to tune the weights (r) so that L Y where the approximation is interpreted in terms of the following error function (called cross-entropy error function):

E() =¡X

m=1

M X

k=1 DO

Lk;mlnYk;m+ (1¡Lk;m)ln(1¡Yk;m)

2

(3)

Our objective is to minimize the error function, i.e. nd a set of weights(r)so thatE()is small, which means thatLY. This will be done in terms of a gradient descent algorithm, so we need the gradient of E with respect to.

3 Learning by Backward Propagation

The gradients of E can be obtained in a reversed fashion, i.e. we start fromY and L and work our way back. This is called Backward Propagation and is done as follows. First notice that

@E()

@l(r)1;l2=¡X

m=1

M X

k=1 DO

Lk;m¡Yk;m

Yk;m(1¡Yk;m) @Yk;m

@l(r)1;l2 (1) The rst step of Backward Propagation is the last step of Forward Propagation: Consider rst

Yk;m= 8<

: X

j=0 DH

j;k(2)Zj;m

9=

;, whereZ0;m= 1; (2)

so

@Yk;m

@j;k(2) = 0 8<

: X

j=0 DH

j;k(2)Zj;m

9=

;Zj;m

= 0 Ak;m(2)

Zj;m; (3)

now we notice that0(a) =(a) (1¡(a))andYk;m= ¡

A(2)

k;m= Ak;m(2)

, hence 0

Ak;m(2)

=Yk;m(1¡Yk;m): (4) Combined, we get

@E()

@(2)j;k = ¡X

m=1

M Lk;m¡Yk;m

Yk;m(1¡Yk;m)Yk;m(1¡Yk;m)Zj;m

= ¡X

m=1 M

(Lk;m¡Yk;m)Zj;m (5)

It will be useful to abbreviate

m(3) = Ym¡Lm

k;m(3) = Yk;m¡Lk;m;

hence

@E()

@j;k(2) =X

m=1 M

k;m(3) Zj;m; (6)

or (for compactness fetishists)

@E()

@(2) =X

m=1 M

m(3)ZmT

3

(4)

ExpandingYk;m in terms of(1)gives us

Yk;m = 8<

: X

j=0 DH

(2)j;kZj;m

9=

; Z0;m = 1

Zj;m = (X

i=0 DI

i; j(1)Xi;m

)

; X0;m = 1

Derivation with respect to(1):

@Yk;m

@i; j(1) = 0 Ak;m(2)

j;k(2)@Zj;m

@i; j(1)

= Yk;m(1¡Yk;m)0 (X

i=0 DI

i; j(1)Xi;m

) Xi;m

= Yk;m(1¡Yk;m)0 Aj;m(1)

(2)j;kXi;m

@E()

@i; j(1) = ¡X

m=1 M

Xi;m0 A(1)j;m

X

k=1 DO

(Lk;m¡Yk;m)j;k(2)

= ¡X

m=1 M

Xi;m0 A(1)j;m

j(2)(Lm¡Ym)

= X

m=1 M

(2)j m(3)0 Aj;m(1)

Xi;m

Hence,

@E()

@i; j(1) =X

m=1 M

j(2)m(3)0 Aj;m(1)

Xi;m (7)

Using some gradient descent algorithm, we can nd a local minimum for the cost function, i.e.

problem adapted values for.

4

Referenzen

ÄHNLICHE DOKUMENTE

Given a directed network with n nodes, a distinguished node called the destination, a set of probabilities {Pij} where arc.. (i,j) is &#34;open&#34; with probability

1 Institute of Soil and Water Resources and Environmental Science, Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang

The results show that with regard to the overall carbon footprint we need to focus on an intelligent mix of powertrains that meets indi- vidual requirements and includes

In contemporary discourse about metaethics, little attention is paid to positions concerning the ontological status of values that were formulated in Germany at the beginning of

Keywords Gradient materials · Virtual power · Boundary conditions · Free surface · Crust shell · Edge beam 1 Introduction.. During the last two centuries, the theory of materials

In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.. Keywords: Stochastic

particles in long inlet tubes can be easily and exactly estimated when using the heat transfer equations..

When the objective function is a lower semicontinuous convex extended function (which happens when one minimizes problems with constraints), the subgradient algorithm