• Keine Ergebnisse gefunden

DeepLearning on FPGAs Artificial Neural Networks: Backpropagation and more Sebastian Buschj¨ager

N/A
N/A
Protected

Academic year: 2022

Aktie "DeepLearning on FPGAs Artificial Neural Networks: Backpropagation and more Sebastian Buschj¨ager"

Copied!
72
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

DeepLearning on FPGAs

Artificial Neural Networks: Backpropagation and more

Sebastian Buschj¨ager

Technische Universit¨at Dortmund - Fakult¨at Informatik - Lehrstuhl 8

November 10, 2016

1

(2)

Recap: Homework

Question: So whats your accuracy?

Question: What about speed?

Some remark about notation: In the previous slides I used θ twice with different meaning

1) As “bias” parameter for the perceptron 2) As vector-to-be-optimized by gradient descent

⇒This is now changed. θwill always be used in a general fashion as the vector-to-be-optimized.

Any questions / remarks / whatsoever?

DeepLearning on FPGAs 2

(3)

Recap: Homework

Question: So whats your accuracy?

Question: What about speed?

Some remark about notation: In the previous slides I used θ twice with different meaning

1) As “bias” parameter for the perceptron 2) As vector-to-be-optimized by gradient descent

⇒This is now changed. θwill always be used in a general fashion as the vector-to-be-optimized.

Any questions / remarks / whatsoever?

DeepLearning on FPGAs 2

(4)

Recap: Data Mining (1)

Important concepts:

Feature Engineering is key to solve Data Mining tasks Deep Learning combines learning and Feature Engineering Data Mining approach:

Specify model family (perceptron)

Specify optimization procedure (gradient descent) Specify a cost / loss function (RMSE or cross-entropy)

Perceptron: A linear classifierf:Rd→ {0,1}with fb(~x) =

(+1 if Pd

i=1wi·xi≥b 0 else

DeepLearning on FPGAs 3

(5)

Recap: Data Mining (1)

Important concepts:

Feature Engineering is key to solve Data Mining tasks Deep Learning combines learning and Feature Engineering Data Mining approach:

Specify model family (perceptron)

Specify optimization procedure (gradient descent) Specify a cost / loss function (RMSE or cross-entropy)

Perceptron: A linear classifierf:Rd→ {0,1}with fb(~x) =

(+1 if Pd

i=1wi·xi≥b 0 else

DeepLearning on FPGAs 3

(6)

Recap: Data Mining (2)

Optimization procedure: Gradient descent

θbnew =θbold−α· ∇θ`(D,θbold)

Loss function: RMSE or cross-entropy

`(D,θ) =b vu ut1

N XN

i=1

yi−f

θb(~xi)2

`(D,θ) =b −1 N

XN i=1

yiln f

θb(~xi)

+ (1−yi) ln 1−f

θb(~xi) So far: Training of single perceptron

Now: Training of multi-layer perceptron (MLP)

DeepLearning on FPGAs 4

(7)

Recap: Data Mining (2)

Optimization procedure: Gradient descent

θbnew =θbold−α· ∇θ`(D,θbold) Loss function: RMSE or cross-entropy

`(D,θ) =b vu ut1

N XN

i=1

yi−f

θb(~xi)2

`(D,θ) =b −1 N

XN i=1

yiln f

θb(~xi)

+ (1−yi) ln 1−f

θb(~xi)

So far: Training of single perceptron

Now: Training of multi-layer perceptron (MLP)

DeepLearning on FPGAs 4

(8)

Recap: Data Mining (2)

Optimization procedure: Gradient descent

θbnew =θbold−α· ∇θ`(D,θbold) Loss function: RMSE or cross-entropy

`(D,θ) =b vu ut1

N XN

i=1

yi−f

θb(~xi)2

`(D,θ) =b −1 N

XN i=1

yiln f

θb(~xi)

+ (1−yi) ln 1−f

θb(~xi) So far: Training of single perceptron

Now: Training of multi-layer perceptron (MLP)

DeepLearning on FPGAs 4

(9)

MLP: Some Notation (1)

x1

x2

... xd

...

i

...

...

j

...

bj w(l+1)

ij

M(l) M(l+1)

l l+ 1

outputfi(l)

b y

w(l+1)i,j =b Weight from neuron iin layerl to neuronj in layerl+ 1

DeepLearning on FPGAs 5

(10)

MLP: Learning

Obviously: We need to learn the weightsw(l)i,j and bias b(l)j So far: We intuitively derived a learning algorithm

Observation: For MLPs we can compare the output layer with our desired output, but what about hidden layers?

Thus: We use gradient descent + “simple” math Gradient descent:

b

wnew=wbold−α· ∇wb`(D,w)b Loss function:

`(D,w) =b vu ut 1

N XN i=1

yi−fb(~xi)2

DeepLearning on FPGAs 6

(11)

MLP: Learning

Obviously: We need to learn the weightsw(l)i,j and bias b(l)j So far: We intuitively derived a learning algorithm

Observation: For MLPs we can compare the output layer with our desired output, but what about hidden layers?

Thus: We use gradient descent + “simple” math

Gradient descent:

b

wnew=wbold−α· ∇wb`(D,w)b Loss function:

`(D,w) =b vu ut 1

N XN i=1

yi−fb(~xi)2

DeepLearning on FPGAs 6

(12)

MLP: Learning

Obviously: We need to learn the weightsw(l)i,j and bias b(l)j So far: We intuitively derived a learning algorithm

Observation: For MLPs we can compare the output layer with our desired output, but what about hidden layers?

Thus: We use gradient descent + “simple” math Gradient descent:

b

wnew=wbold−α· ∇wb`(D,w)b Loss function:

`(D,w) =b vu ut 1

N XN i=1

yi−fb(~xi)2

DeepLearning on FPGAs 6

(13)

MLP: Learning (2)

`(D,w) =b vu ut 1

N XN

i=1

yi−fb(~xi)2

Observation: We need to take the derivative of the loss function

But: Loss functions looks complicated Observation 1: Square-Root is monotone

Observation 2: Loss function depends on entire training data set! Thus: Perform stochastic gradient descent

Randomly choose one examples ito compute the loss function Update the parameters as in normal gradient descent

Continue until convergence

Note: Forα →0 it “almost surely” converges

DeepLearning on FPGAs 7

(14)

MLP: Learning (2)

`(D,w) =b vu ut 1

N XN

i=1

yi−fb(~xi)2

Observation: We need to take the derivative of the loss function But: Loss functions looks complicated

Observation 1: Square-Root is monotone

Observation 2: Loss function depends on entire training data set!

Thus: Perform stochastic gradient descent

Randomly choose one examples ito compute the loss function Update the parameters as in normal gradient descent

Continue until convergence

Note: Forα →0 it “almost surely” converges

DeepLearning on FPGAs 7

(15)

MLP: Learning (2)

`(D,w) =b vu ut 1

N XN

i=1

yi−fb(~xi)2

Observation: We need to take the derivative of the loss function But: Loss functions looks complicated

Observation 1: Square-Root is monotone

Observation 2: Loss function depends on entire training data set!

Thus: Perform stochastic gradient descent

Randomly choose one examples ito compute the loss function Update the parameters as in normal gradient descent

Continue until convergence

Note: Forα →0 it “almost surely” converges

DeepLearning on FPGAs 7

(16)

MLP: Learning (2)

`(D,w) =b vu ut 1

N XN

i=1

yi−fb(~xi)2

Observation: We need to take the derivative of the loss function But: Loss functions looks complicated

Observation 1: Square-Root is monotone

Observation 2: Loss function depends on entire training data set!

Thus: Perform stochastic gradient descent

Randomly choose one examples ito compute the loss function Update the parameters as in normal gradient descent

Continue until convergence

Note: Forα→0 it “almost surely” converges

DeepLearning on FPGAs 7

(17)

MLP: Learning (3)

New loss function:

`(D,w) =b 1 2

yi−fb(~xi)2

wb`(D,w) =b 1

22(yi−fb(~xi))∂fb(~xi)

∂wb

Observation: We need to compute derivative fb(~xi)

wb

fb(~x) =

(+1 if Pd

i=1wi·xi+b≥0 0 else

Observation: f is not continuous in 0(it makes a step) Thus: Impossible to derive∇wb`(D, w) in0, becausef is not differentiable in0!

DeepLearning on FPGAs 8

(18)

MLP: Learning (3)

New loss function:

`(D,w) =b 1 2

yi−fb(~xi)2

wb`(D,w) =b 1

22(yi−fb(~xi))∂fb(~xi)

∂wb Observation: We need to compute derivative fb(~xi)

wb

fb(~x) =

(+1 if Pd

i=1wi·xi+b≥0 0 else

Observation: f is not continuous in 0(it makes a step) Thus: Impossible to derive∇wb`(D, w) in0, becausef is not differentiable in0!

DeepLearning on FPGAs 8

(19)

MLP: Learning (3)

New loss function:

`(D,w) =b 1 2

yi−fb(~xi)2

wb`(D,w) =b 1

22(yi−fb(~xi))∂fb(~xi)

∂wb Observation: We need to compute derivative fb(~xi)

wb

fb(~x) =

(+1 if Pd

i=1wi·xi+b≥0 0 else

Observation: f is not continuous in 0(it makes a step) Thus: Impossible to derive∇wb`(D, w) in0, becausef is not differentiable in0!

DeepLearning on FPGAs 8

(20)

MLP: Learning (3)

New loss function:

`(D,w) =b 1 2

yi−fb(~xi)2

wb`(D,w) =b 1

22(yi−fb(~xi))∂fb(~xi)

∂wb Observation: We need to compute derivative fb(~xi)

wb

fb(~x) =

(+1 if Pd

i=1wi·xi+b≥0 0 else

Observation: f is not continuous in 0(it makes a step) Thus: Impossible to derive∇wb`(D, w) in0, becausef is not differentiable in0!

DeepLearning on FPGAs 8

(21)

MLP: Activation function

Solution: We need to makef continuous

Bonus: This seems to be a little closer to real neurons

Bonus 2: We have non-linearity inside the network (more later) Idea: Use sigmoid activation function

x y

4 3 2 1 1 2 3 4 1

σ(z) = 1

1 +e−β·z, β∈R>0

Note: β controls slope around 0

DeepLearning on FPGAs 9

(22)

MLP: Activation function

Solution: We need to makef continuous

Bonus: This seems to be a little closer to real neurons

Bonus 2: We have non-linearity inside the network (more later)

Idea: Use sigmoid activation function

x y

4 3 2 1 1 2 3 4 1

σ(z) = 1

1 +e−β·z, β∈R>0

Note: β controls slope around 0

DeepLearning on FPGAs 9

(23)

MLP: Activation function

Solution: We need to makef continuous

Bonus: This seems to be a little closer to real neurons

Bonus 2: We have non-linearity inside the network (more later) Idea: Use sigmoid activation function

x y

4 3 2 1 1 2 3 4 1

σ(z) = 1

1 +e−β·z, β∈R>0

Note: β controls slope around 0

DeepLearning on FPGAs 9

(24)

Sigmoid activation function: Derivative

Given: σ(z) = 1+e1−β·z, β∈R>0

Derivative:

∂σ(z)

∂z = ∂

∂z

1 +e−βz−1

= (−1)

1 +e−βz−2

(−β)e−βz

= βe−βz

(1 +e−βz)2 =β e−βz 1 +e−βz

1 1 +e−βz

= βe−βz+ 1−1 1 +e−βz

1 1 +e−βz

= β

1 +e−βz

1 +e−βz − 1 1 +e−βz

1 1 +e−βz

= β(1−σ(z))σ(z)

DeepLearning on FPGAs 10

(25)

Sigmoid activation function: Derivative

Given: σ(z) = 1+e1−β·z, β∈R>0

Derivative:

∂σ(z)

∂z = ∂

∂z

1 +e−βz−1

= (−1)

1 +e−βz−2

(−β)e−βz

= βe−βz

(1 +e−βz)2 =β e−βz 1 +e−βz

1 1 +e−βz

= βe−βz+ 1−1 1 +e−βz

1 1 +e−βz

= β

1 +e−βz

1 +e−βz − 1 1 +e−βz

1 1 +e−βz

= β(1−σ(z))σ(z)

DeepLearning on FPGAs 10

(26)

Sigmoid activation function: Derivative

Given: σ(z) = 1+e1−β·z, β∈R>0

Derivative:

∂σ(z)

∂z = ∂

∂z

1 +e−βz−1

= (−1)

1 +e−βz−2

(−β)e−βz

= βe−βz

(1 +e−βz)2 =β e−βz 1 +e−βz

1 1 +e−βz

= βe−βz+ 1−1 1 +e−βz

1 1 +e−βz

= β

1 +e−βz

1 +e−βz − 1 1 +e−βz

1 1 +e−βz

= β(1−σ(z))σ(z)

DeepLearning on FPGAs 10

(27)

Sigmoid activation function: Derivative

Given: σ(z) = 1+e1−β·z, β∈R>0

Derivative:

∂σ(z)

∂z = ∂

∂z

1 +e−βz−1

= (−1)

1 +e−βz−2

(−β)e−βz

= βe−βz

(1 +e−βz)2 =β e−βz 1 +e−βz

1 1 +e−βz

= βe−βz+ 1−1 1 +e−βz

1 1 +e−βz

= β

1 +e−βz

1 +e−βz − 1 1 +e−βz

1 1 +e−βz

= β(1−σ(z))σ(z)

DeepLearning on FPGAs 10

(28)

Sigmoid activation function: Derivative

Given: σ(z) = 1+e1−β·z, β∈R>0

Derivative:

∂σ(z)

∂z = ∂

∂z

1 +e−βz−1

= (−1)

1 +e−βz−2

(−β)e−βz

= βe−βz

(1 +e−βz)2 =β e−βz 1 +e−βz

1 1 +e−βz

= βe−βz+ 1−1 1 +e−βz

1 1 +e−βz

= β

1 +e−βz

1 +e−βz − 1 1 +e−βz

1 1 +e−βz

= β(1−σ(z))σ(z)

DeepLearning on FPGAs 10

(29)

Sigmoid activation function: Derivative

Given: σ(z) = 1+e1−β·z, β∈R>0

Derivative:

∂σ(z)

∂z = ∂

∂z

1 +e−βz−1

= (−1)

1 +e−βz−2

(−β)e−βz

= βe−βz

(1 +e−βz)2 =β e−βz 1 +e−βz

1 1 +e−βz

= βe−βz+ 1−1 1 +e−βz

1 1 +e−βz

= β

1 +e−βz

1 +e−βz − 1 1 +e−βz

1 1 +e−βz

= β(1−σ(z))σ(z)

DeepLearning on FPGAs 10

(30)

MLP: Activation function (2)

But: Binary classification assumesY ={0,+1}

Thus: Given Llayer in total

Internally: We usefj(l+1) =σPM(l)

i=0 wi,j(l+1)fi(l)+b(l+1)j Prediction: Is mapped to0 or1:

fb(~x) =

(+1 if σPM(L−1)

i=0 wi(L)fi(L−1)+b(L)

≥0 0 else

Learning with gradient descent:

w(l)i,j = wi,j(l)−α· ∂`

∂w(l)i,j b(l)j = b(l)j −α· ∂`

∂b(l)j

DeepLearning on FPGAs 11

(31)

MLP: Activation function (2)

But: Binary classification assumesY ={0,+1} Thus: Given Llayer in total

Internally: We use fj(l+1) =σPM(l)

i=0 wi,j(l+1)fi(l)+b(l+1)j Prediction: Is mapped to0 or1:

fb(~x) =

(+1 if σPM(L−1)

i=0 w(L)i fi(L−1)+b(L)

≥0 0 else

Learning with gradient descent:

w(l)i,j = wi,j(l)−α· ∂`

∂w(l)i,j b(l)j = b(l)j −α· ∂`

∂b(l)j

DeepLearning on FPGAs 11

(32)

MLP: Activation function (2)

But: Binary classification assumesY ={0,+1} Thus: Given Llayer in total

Internally: We use fj(l+1) =σPM(l)

i=0 wi,j(l+1)fi(l)+b(l+1)j Prediction: Is mapped to0 or1:

fb(~x) =

(+1 if σPM(L−1)

i=0 w(L)i fi(L−1)+b(L)

≥0 0 else

Learning with gradient descent:

w(l)i,j = w(l)i,j −α· ∂`

∂wi,j(l) b(l)j = b(l)j −α· ∂`

∂b(l)j

DeepLearning on FPGAs 11

(33)

MLP: Notation Recap

Note: Too manyl and`’s: Use E=`(loss) for easier reading

...

i

...

...

j

...

bj w(l+1)

ij

l l+ 1

outputfi(l)

M(l) M(l+1)

find : ∂E

∂w(l)i,j, ∂E

∂b(l)j M(l) =b #Neurons in layer l yj(l+1) =

MX(l)

i=0

wi,j(l+1)fi(l)+b(l+1)j fj(l+1) = σ

yj(l+1)

σ(z) = 1

1 +e−β·z, β= 1

DeepLearning on FPGAs 12

(34)

. . . . . . . . .

DeepLearning on FPGAs 13

(35)

Backpropagation for sigmoid activation / RMSE loss

Gradient step:

wi,j(l) = wi,j(l)−α·δj(l)fi(l−1) b(l)j = b(l)j −α·δj(l) Recursion:

δ(l−1)j = fj(l−1)

1−fj(l−1)MX(l)

k=1

δ(l)k wj,k(l) δj(L) = −

yi−fj(L) fj(L)

1−fj(L)

derivative of activation function

derivative of loss function

DeepLearning on FPGAs 14

(36)

Backpropagation for sigmoid activation / RMSE loss

Gradient step:

wi,j(l) = wi,j(l)−α·δj(l)fi(l−1) b(l)j = b(l)j −α·δj(l) Recursion:

δ(l−1)j = fj(l−1)

1−fj(l−1)MX(l)

k=1

δ(l)k wj,k(l) δj(L) = −

yi−fj(L) fj(L)

1−fj(L) derivative of activation function

derivative of loss function

DeepLearning on FPGAs 14

(37)

Backpropagation for activation h / loss `

Gradient step:

wi,j(l) = wi,j(l)−α·δj(l)fi(l−1) b(l)j = b(l)j −α·δj(l) Recursion:

δj(l−1) = ∂h(yi(l−1))

∂y(l−1)i

MX(l)

k=1

δk(l)w(l)j,k

δ(L)j = ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L)

DeepLearning on FPGAs 15

(38)

Backpropagation: Different notation

Notation: We used scalar notation so far

Fact: Same results can be derived using matrix-vector notation

→Notation depends on your preferences and background

For us: We want to implement backprop. from scratch, thus scalar notation is closer to our implementation

But: Literature usually use matrix-vector notation for compactness

δ(l−1) =

W(l)T

δ(l)∂h(y(l−1))

∂y(l−1) δ(L) = ∇y(L)`(y(L))∂h(y(L))

∂y(L)

vectorial derivative!

Hadamard-product / Schur-product: piecewise multiplication

DeepLearning on FPGAs 16

(39)

Backpropagation: Different notation

Notation: We used scalar notation so far

Fact: Same results can be derived using matrix-vector notation

→Notation depends on your preferences and background For us: We want to implement backprop. from scratch, thus scalar notation is closer to our implementation

But: Literature usually use matrix-vector notation for compactness

δ(l−1) =

W(l)T

δ(l)∂h(y(l−1))

∂y(l−1) δ(L) = ∇y(L)`(y(L))∂h(y(L))

∂y(L)

vectorial derivative!

Hadamard-product / Schur-product: piecewise multiplication

DeepLearning on FPGAs 16

(40)

Backpropagation: Different notation

Notation: We used scalar notation so far

Fact: Same results can be derived using matrix-vector notation

→Notation depends on your preferences and background For us: We want to implement backprop. from scratch, thus scalar notation is closer to our implementation

But: Literature usually use matrix-vector notation for compactness

δ(l−1) =

W(l)T

δ(l)∂h(y(l−1))

∂y(l−1) δ(L) = ∇y(L)`(y(L))∂h(y(L))

∂y(L)

vectorial derivative!

Hadamard-product / Schur-product: piecewise multiplication

DeepLearning on FPGAs 16

(41)

Backpropagation: Different notation

Notation: We used scalar notation so far

Fact: Same results can be derived using matrix-vector notation

→Notation depends on your preferences and background For us: We want to implement backprop. from scratch, thus scalar notation is closer to our implementation

But: Literature usually use matrix-vector notation for compactness

δ(l−1) =

W(l)T

δ(l)∂h(y(l−1))

∂y(l−1) δ(L) = ∇y(L)`(y(L))∂h(y(L))

∂y(L)

vectorial derivative!

Hadamard-product / Schur-product: piecewise multiplication

DeepLearning on FPGAs 16

(42)

Backpropagation: Some implementation ideas

Observation: Backprop. is independent from activationh and loss`

Thus: Implement neural networks layer-wise: Each layer / neuron has activation function

Each layer / neuron has derivative of activation function Each layer has weight matrix (either for input or output) Each layer implements delta computation

Output-layer implements delta computation with loss function Layers are either connected to each other and recursively call backprop. or some “control” function performs backprop. Thus: Arbitrary network architectures can be realised without changing learning algorithm

DeepLearning on FPGAs 17

(43)

Backpropagation: Some implementation ideas

Observation: Backprop. is independent from activationh and loss` Thus: Implement neural networks layer-wise:

Each layer / neuron has activation function

Each layer / neuron has derivative of activation function Each layer has weight matrix (either for input or output) Each layer implements delta computation

Output-layer implements delta computation with loss function Layers are either connected to each other and recursively call backprop. or some “control” function performs backprop.

Thus: Arbitrary network architectures can be realised without changing learning algorithm

DeepLearning on FPGAs 17

(44)

Backpropagation: Some implementation ideas

Observation: Backprop. is independent from activationh and loss` Thus: Implement neural networks layer-wise:

Each layer / neuron has activation function

Each layer / neuron has derivative of activation function Each layer has weight matrix (either for input or output) Each layer implements delta computation

Output-layer implements delta computation with loss function Layers are either connected to each other and recursively call backprop. or some “control” function performs backprop.

Thus: Arbitrary network architectures can be realised without changing learning algorithm

DeepLearning on FPGAs 17

(45)

Network architectures

Question: So what is a good architecture?

Answer: Depends on the problem. Usually, architectures for new problems are published in scientific papers or even as PHD thesis.

Some general ideas:

Non-linear activation: A network should contain at least one layer with non-linear activation function for better learning Sparse activation: To prevent over-fitting, only a few neurons of the network should be active at the same time Fast convergence: The loss function / activation function should allow a fast convergence in the first few epochs Feature extraction: Combining multiple layers in deeper networks usually allows (higher) level feature extraction

DeepLearning on FPGAs 18

(46)

Network architectures

Question: So what is a good architecture?

Answer: Depends on the problem. Usually, architectures for new problems are published in scientific papers or even as PHD thesis.

Some general ideas:

Non-linear activation: A network should contain at least one layer with non-linear activation function for better learning Sparse activation: To prevent over-fitting, only a few neurons of the network should be active at the same time Fast convergence: The loss function / activation function should allow a fast convergence in the first few epochs Feature extraction: Combining multiple layers in deeper networks usually allows (higher) level feature extraction

DeepLearning on FPGAs 18

(47)

Network architectures

Question: So what is a good architecture?

Answer: Depends on the problem. Usually, architectures for new problems are published in scientific papers or even as PHD thesis.

Some general ideas:

Non-linear activation: A network should contain at least one layer with non-linear activation function for better learning Sparse activation: To prevent over-fitting, only a few neurons of the network should be active at the same time Fast convergence: The loss function / activation function should allow a fast convergence in the first few epochs Feature extraction: Combining multiple layers in deeper networks usually allows (higher) level feature extraction

DeepLearning on FPGAs 18

(48)

Backpropagation: Vanishing gradients

Observation 1: σ(z) = 1+e1−β·z ∈[0,1]

Observation 2: ∂σ(z)∂z =σ(z)·(1−σ(z))∈[0,1]

Observation 3: Errors are multiplied from the next layer

Thus: The error tends to become very small after a few layers

⇒The gradient vanishes in each layer more and more

So far: No fundamental solution found, but a few suggestions Change activation function

Exploit different optimization methods Use more data / carefully adjust stepsizes

Reduce number of parameters / depth of network

DeepLearning on FPGAs 19

(49)

Backpropagation: Vanishing gradients

Observation 1: σ(z) = 1+e1−β·z ∈[0,1]

Observation 2: ∂σ(z)∂z =σ(z)·(1−σ(z))∈[0,1]

Observation 3: Errors are multiplied from the next layer

Thus: The error tends to become very small after a few layers

⇒The gradient vanishes in each layer more and more

So far: No fundamental solution found, but a few suggestions Change activation function

Exploit different optimization methods Use more data / carefully adjust stepsizes

Reduce number of parameters / depth of network

DeepLearning on FPGAs 19

(50)

Backpropagation: Vanishing gradients

Observation 1: σ(z) = 1+e1−β·z ∈[0,1]

Observation 2: ∂σ(z)∂z =σ(z)·(1−σ(z))∈[0,1]

Observation 3: Errors are multiplied from the next layer

Thus: The error tends to become very small after a few layers

⇒The gradient vanishes in each layer more and more

So far: No fundamental solution found, but a few suggestions Change activation function

Exploit different optimization methods Use more data / carefully adjust stepsizes

Reduce number of parameters / depth of network

DeepLearning on FPGAs 19

(51)

New activation function: ReLu

Rectified Linear (ReLu):

x y

2 1 1 2 1

2

h(z) =

(z if z≥0

0 else =max(0, z)

∂h(z)

∂z =

(1 if z≥0 0 else

Note: ReLu is not differentiable inz= 0! But: Usually that is not a problem

Practical: z= 0 is pretty rare, just use 0there. It works well Mathematical: There exists a subgradient ofh(z) at0

DeepLearning on FPGAs 20

(52)

New activation function: ReLu

Rectified Linear (ReLu):

x y

2 1 1 2 1

2 h(z) =

(z if z≥0

0 else =max(0, z)

∂h(z)

∂z =

(1 if z≥0 0 else

Note: ReLu is not differentiable inz= 0! But: Usually that is not a problem

Practical: z= 0 is pretty rare, just use 0there. It works well Mathematical: There exists a subgradient ofh(z) at0

DeepLearning on FPGAs 20

(53)

New activation function: ReLu

Rectified Linear (ReLu):

x y

2 1 1 2 1

2 h(z) =

(z if z≥0

0 else =max(0, z)

∂h(z)

∂z =

(1 if z≥0 0 else

Note: ReLu is not differentiable inz= 0!

But: Usually that is not a problem

Practical: z= 0 is pretty rare, just use 0there. It works well Mathematical: There exists a subgradient ofh(z) at0

DeepLearning on FPGAs 20

(54)

New activation function: ReLu

Rectified Linear (ReLu):

x y

2 1 1 2 1

2 h(z) =

(z if z≥0

0 else =max(0, z)

∂h(z)

∂z =

(1 if z≥0 0 else

Note: ReLu is not differentiable inz= 0!

But: Usually that is not a problem

Practical: z= 0 is pretty rare, just use 0there. It works well Mathematical: There exists a subgradient ofh(z) at0

DeepLearning on FPGAs 20

(55)

ReLu(2)

Subgradients: A gradient shows the direct of the steepest descent

⇒If a function is not differentiable, it has no steepest descent

⇒There might be multiple (equally) “steepest descents”

For ReLu: We can choose ∂h(z)∂z

z=0 from[0,1]

Big Note: Using a subgradient does not guarantee that our loss function decreases! We might change weights to the worse!

Nice properties of ReLu:

Super-easy forward, backward and derivative computation Either activates or deactivates a neuron (sparsity)

Less problems with gradient vanishing, since error is multiplied by1 or 0

Still gives network non-linear activation

DeepLearning on FPGAs 21

(56)

ReLu(2)

Subgradients: A gradient shows the direct of the steepest descent

⇒If a function is not differentiable, it has no steepest descent

⇒There might be multiple (equally) “steepest descents”

For ReLu: We can choose ∂h(z)∂z

z=0 from[0,1]

Big Note: Using a subgradient does not guarantee that our loss function decreases! We might change weights to the worse!

Nice properties of ReLu:

Super-easy forward, backward and derivative computation Either activates or deactivates a neuron (sparsity)

Less problems with gradient vanishing, since error is multiplied by1 or 0

Still gives network non-linear activation

DeepLearning on FPGAs 21

(57)

ReLu(2)

Subgradients: A gradient shows the direct of the steepest descent

⇒If a function is not differentiable, it has no steepest descent

⇒There might be multiple (equally) “steepest descents”

For ReLu: We can choose ∂h(z)∂z

z=0 from[0,1]

Big Note: Using a subgradient does not guarantee that our loss function decreases! We might change weights to the worse!

Nice properties of ReLu:

Super-easy forward, backward and derivative computation Either activates or deactivates a neuron (sparsity)

Less problems with gradient vanishing, since error is multiplied by1 or 0

Still gives network non-linear activation

DeepLearning on FPGAs 21

(58)

Improve convergence for GD: Simple improvements

Gradient descent:

θbnew =θbold−α· ∇θ`(D,θbold)

Momentum: Keep the momentum from previous updates

∆θbold = α1· ∇θ`(D,bθold) +α2∆bθold θbnew = θbold−∆θbold

(Mini-)Batch: Compute derivatives for multiple examples and average direction (allows parallel computation of gradient)

θbnew=bθold−α· 1 K

XK i=0

θ`(~xi,θbold)

Note: For Mini-Batch approaches the convergence is not guranteed theoretically

DeepLearning on FPGAs 22

(59)

Improve convergence for GD: Simple improvements

Gradient descent:

θbnew =θbold−α· ∇θ`(D,θbold)

Momentum: Keep the momentum from previous updates

∆θbold = α1· ∇θ`(D,bθold) +α2∆bθold θbnew = θbold−∆θbold

(Mini-)Batch: Compute derivatives for multiple examples and average direction (allows parallel computation of gradient)

θbnew=bθold−α· 1 K

XK i=0

θ`(~xi,θbold)

Note: For Mini-Batch approaches the convergence is not guranteed theoretically

DeepLearning on FPGAs 22

(60)

Improve convergence for GD: Simple improvements

Gradient descent:

θbnew =θbold−α· ∇θ`(D,θbold)

Momentum: Keep the momentum from previous updates

∆θbold = α1· ∇θ`(D,bθold) +α2∆bθold θbnew = θbold−∆θbold

(Mini-)Batch: Compute derivatives for multiple examples and average direction (allows parallel computation of gradient)

θbnew=bθold−α· 1 K

XK i=0

θ`(~xi,θbold)

Note: For Mini-Batch approaches the convergence is not guranteed theoretically

DeepLearning on FPGAs 22

(61)

Improve convergence: Stepsize

What about the stepsize?

If its to small, you will learn slow (→ more data required) If its to big, you might miss the optimum (→ bad results)

Thus usually: Small α= 0.001−0.1 with a lot of data

Note: We can always reuse our data (multiple passes over dataset) But: Stepsize is problem specific as always!

Practical suggestion: Simple heuristic

Try out different stepsizes on small subsample of data Pick that one that most reduces the loss

Use it for on the full dataset

Sidenote: Changing the stepsize while training also possible

DeepLearning on FPGAs 23

(62)

Improve convergence: Stepsize

What about the stepsize?

If its to small, you will learn slow (→ more data required) If its to big, you might miss the optimum (→ bad results) Thus usually: Small α= 0.001−0.1 with a lot of data

Note: We can always reuse our data (multiple passes over dataset) But: Stepsize is problem specific as always!

Practical suggestion: Simple heuristic

Try out different stepsizes on small subsample of data Pick that one that most reduces the loss

Use it for on the full dataset

Sidenote: Changing the stepsize while training also possible

DeepLearning on FPGAs 23

(63)

Improve convergence: Stepsize

What about the stepsize?

If its to small, you will learn slow (→ more data required) If its to big, you might miss the optimum (→ bad results) Thus usually: Small α= 0.001−0.1 with a lot of data

Note: We can always reuse our data (multiple passes over dataset) But: Stepsize is problem specific as always!

Practical suggestion: Simple heuristic

Try out different stepsizes on small subsample of data Pick that one that most reduces the loss

Use it for on the full dataset

Sidenote: Changing the stepsize while training also possible

DeepLearning on FPGAs 23

(64)

Improve convergence: Loss functions

Recap: δ(L)j should be relatively large for faster learning:

δj(L)= ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L) = ∂`(y)b

∂by ·∂h(y))b

∂yb

Squared error: `(D,bθ) = 12(y−by)2∂b∂`y =−(y−y)b

→δ(L)j =−(y−y)b ·∂h(∂byy))b is still small if sigmoid is used Cross-entropy: `(D,θ) =b −(yln (y) + (1b −y) ln (1−y))b

⇒ ∂`

∂by =−y b

y +1−y

1−yb= yb−y (1−y)byb

→δ(L)j = (1−y−yb

y)bby ·∂h(b∂byy)) =yb−y cancels small sigmoid values

tends to be small ifhis sigmoid

derivative of sigmoid function

DeepLearning on FPGAs 24

(65)

Improve convergence: Loss functions

Recap: δ(L)j should be relatively large for faster learning:

δj(L)= ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L) = ∂`(y)b

∂by ·∂h(y))b

∂yb

Squared error: `(D,bθ) = 12(y−by)2∂b∂`y =−(y−y)b

→δ(L)j =−(y−y)b ·∂h(∂byy))b is still small if sigmoid is used Cross-entropy: `(D,θ) =b −(yln (y) + (1b −y) ln (1−y))b

⇒ ∂`

∂by =−y b

y +1−y

1−yb= yb−y (1−y)byb

→δ(L)j = (1−y−yb

y)bby ·∂h(b∂byy)) =yb−y cancels small sigmoid values

tends to be small ifhis sigmoid

derivative of sigmoid function

DeepLearning on FPGAs 24

(66)

Improve convergence: Loss functions

Recap: δ(L)j should be relatively large for faster learning:

δj(L)= ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L) = ∂`(y)b

∂by ·∂h(y))b

∂yb Squared error: `(D,bθ) = 12(y−by)2∂b∂`y =−(y−y)b

→δ(L)j =−(y−y)b ·∂h(∂byy))b is still small if sigmoid is used

Cross-entropy: `(D,θ) =b −(yln (y) + (1b −y) ln (1−y))b

⇒ ∂`

∂by =−y b

y +1−y

1−yb= yb−y (1−y)byb

→δ(L)j = (1−y−yb

y)bby ·∂h(b∂byy)) =yb−y cancels small sigmoid values

tends to be small ifhis sigmoid

derivative of sigmoid function

DeepLearning on FPGAs 24

(67)

Improve convergence: Loss functions

Recap: δ(L)j should be relatively large for faster learning:

δj(L)= ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L) = ∂`(y)b

∂by ·∂h(y))b

∂yb Squared error: `(D,bθ) = 12(y−by)2∂b∂`y =−(y−y)b

→δ(L)j =−(y−y)b ·∂h(∂byy))b is still small if sigmoid is used Cross-entropy: `(D,θ) =b −(yln (y) + (1b −y) ln (1−y))b

⇒ ∂`

∂by =−y b

y +1−y

1−yb= yb−y (1−y)byb

→δ(L)j = (1−y−yb

y)bby ·∂h(b∂byy)) =yb−y cancels small sigmoid values

tends to be small ifhis sigmoid

derivative of sigmoid function

DeepLearning on FPGAs 24

(68)

Improve convergence: Loss functions

Recap: δ(L)j should be relatively large for faster learning:

δj(L)= ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L) = ∂`(y)b

∂by ·∂h(y))b

∂yb Squared error: `(D,bθ) = 12(y−by)2∂b∂`y =−(y−y)b

→δ(L)j =−(y−y)b ·∂h(∂byy))b is still small if sigmoid is used Cross-entropy: `(D,θ) =b −(yln (y) + (1b −y) ln (1−y))b

⇒ ∂`

∂by =−y b

y +1−y

1−yb= yb−y (1−y)byb

→δ(L)j = (1−y−yb

y)bby ·∂h(b∂byy)) =yb−y cancels small sigmoid values

tends to be small ifhis sigmoid

derivative of sigmoid function

DeepLearning on FPGAs 24

(69)

Improve convergence: Loss functions

Recap: δ(L)j should be relatively large for faster learning:

δj(L)= ∂`(y(L)i )

∂yi(L) ·∂h(y(L)i )

∂yi(L) = ∂`(y)b

∂by ·∂h(y))b

∂yb Squared error: `(D,bθ) = 12(y−by)2∂b∂`y =−(y−y)b

→δ(L)j =−(y−y)b ·∂h(∂byy))b is still small if sigmoid is used Cross-entropy: `(D,θ) =b −(yln (y) + (1b −y) ln (1−y))b

⇒ ∂`

∂by =−y b

y +1−y

1−yb= yb−y (1−y)byb

→δ(L)j = (1−y−yb

y)bby ·∂h(b∂byy)) =yb−y cancels small sigmoid values

tends to be small ifhis sigmoid

derivative of sigmoid function

DeepLearning on FPGAs 24

(70)

Improve Convergence: Start solution

Where do we start?

In SGD:Start with someθ. SGD will walk us the right direction Important: For NN (specifically for MSE + sigmoid activation) we need “sane” initialization:

δj(L) = −

yi−fj(L) fj(L)

1−fj(L)

⇒δj(L) = 0, iffj(L)= 0 or fj(L)= 1

Therefore: Init weights randomly with gaussian distribution

w(l)ij ∼ N(0, ε) with ε= 0.001−0.1 Bonus: Negative weights are also present

DeepLearning on FPGAs 25

(71)

Summary

Important concepts:

For parameter optimization we define a loss function For parameter optimization we use gradient descent

Neuronshave activation functions to ensure non-linearity and differentiability

Backpropagation is an algorithm to compute the gradient Non-linear and sparsenetworks are usually better

Various techniques can be used to improve convergence speed

DeepLearning on FPGAs 26

(72)

Homework

Homeworkuntil next meeting

Implement the following network to solve the XOR problem x1

x2

Implement backpropagation for this network

Try a simple solution first: Hardcode one activation / one loss function with fixed access to data structures

If you feel comfortable, add new activation / loss functions Tip 1: Verify that the proposed network uses 9parameters Tip 2: Start withα = 1.0and10000 training examples Note: We will later useC, so please use Cor aC-like language Question: Can you reduce the number of examples necessary?

DeepLearning on FPGAs 27

Referenzen

ÄHNLICHE DOKUMENTE

It just uses the data D → really fast model computation But: Application is really slow, since we search over all examples Observation 2: It is enough to only look at examples “near”

Feature Engineering is key to solve Data Mining tasks Deep Learning combines learning and Feature Engineering A perceptron is a simple linear model for classification A

Note 1: Since convolution is used internally, there is no need for mapping values inside the net → use computed values directly Note 2: The size of the resulting image depends on

For µ < 0, both exist a stable limit cycle and an attractive fixed point in the origin.. But for increasing µ, the attractive area around the origin decreases and finally

Active caspase-11 triggers pyroptosis and activates a canonical Nlrp3 inflammasome to promote caspase-1 activation and IL-1 secretion4. The molecular details of

The aim of this thesis was to develop new systems for heterolytic dihydrogen activation and catalytic hydrogenation that combine relatively strong bases, and pyridinium salts

As pointed out in Paragraph 6 above, any steady voltage existing in the computer circuit may be read to three place accuracy by using the NULL POT in

(a) Feature-spread models predict that velar lowering in anticipation of a nasal consonant extends to the beginning of the vocalic sequence preceding the nasal consonant, regardless