• Keine Ergebnisse gefunden

Word2vec embeddings: CBOW and Skipgram

N/A
N/A
Protected

Academic year: 2022

Aktie "Word2vec embeddings: CBOW and Skipgram"

Copied!
77
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Word2vec embeddings: CBOW and Skipgram

VL Embeddings

Uni Heidelberg

SS 2019

(2)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Intuition

Window size: 2

Center word at position t: Maus

P(wt−2|wt) P(wt−1|wt) P(wt+1|wt) P(wt+2|wt)

Die kleine graue Maus frißt den leckeren ase

wt−2 wt−1 wt wt+1 wt+2

Same probability distribution used for all context words

(3)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Intuition

Window size: 2

Center word at position t: frißt

P(wt−2|wt) P(wt−1|wt) P(wt+1|wt) P(wt+2|wt)

Die kleine graue Maus frißt den leckeren ase

wt−2 wt−1 wt wt+1 wt+2

Same probability distribution used for all context words

(4)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Intuition

Window size: 2

Center word at position t:

P(wt−2|wt) P(wt−1|wt) P(wt+1|wt) P(wt+2|wt)

Die kleine graue Maus frißt den leckeren ase

wt−2 wt−1 wt wt+1 wt+2

Same probability distribution used for all context words

(5)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) (1) Likelihood =

j6= 0 What is θ?

J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Minimising objective function⇔ maximising predictive accuracy

(6)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) (1) Likelihood =

j6= 0 θ: vector representations of each word

J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Minimising objective function⇔ maximising predictive accuracy

(7)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) (1) Likelihood =

j6= 0 θ: vector representations of each word Objective function(cost function, loss function): Maximise the probability of any context word given the current center wordwt

J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Minimising objective function⇔ maximising predictive accuracy

(8)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) (1) Likelihood =

j6= 0 θ: vector representations of each word Theobjective functionJ(θ) is the (average) negative log-likelihood:

(cost function, loss function) J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Minimising objective function⇔ maximising predictive accuracy

(9)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) (1) Likelihood =

j6= 0 θ: vector representations of each word Theobjective functionJ(θ) is the (average) negative log-likelihood:

(cost function, loss function) J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Minimising objective function⇔ maximising predictive accuracy

(10)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Objective function – Motivation

We want to model the probability distribution over mutually exclusive classes

measure the difference between predicted probabilities ˆy and ground-truth probabilitiesy

during training: tune parameters so that this difference is minimised

(11)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Negative log-likelihood

Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)?

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) MLE =argmax L(θ,x)

The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties)

arg max

x

(x) is equivalent to arg min

x

(−x)

J(θ) =−T1log L(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ)

(12)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Negative log-likelihood

Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)?

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|wt;θ) MLE =argmax L(θ,x)

The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties)

arg max

x

(x) is equivalent to arg min

x

(−x)

J(θ) =−T1log L(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ)

(13)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Negative log-likelihood

We can interpret negative log-probability as information content or surprisal

What is the log-likelihood of a model, given an event?

⇒ The negative of the surprisal of the event, given the model:

A model is supported by an event to the extent that the event is unsurprising, given the model.

(14)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss

Negative log likelihood is the same as cross entropy Recap: Entropy

If a discrete random variableX has the probability p(x), then the entropy ofX is

H(X) =X

x

p(x)log 1

p(x) =−X

x

p(x)log p(x)

⇒ expected number of bits needed to encodeX if we use an optimal coding scheme

Cross entropy

⇒ number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x)

H(p,q) =X

x

p(x)log 1

q(x) =−X

x

p(x)log q(x)

(15)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss

Negative log likelihood is the same as cross entropy Recap: Entropy

If a discrete random variableX has the probability p(x), then the entropy ofX is

H(X) =X

x

p(x)log 1

p(x) =−X

x

p(x)log p(x)

⇒ expected number of bits needed to encodeX if we use an optimal coding scheme

Cross entropy

⇒ number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x)

H(p,q) =X

x

p(x)log 1

q(x) =−X

x

p(x)log q(x)

(16)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q)

Kullback-Leibler (KL) divergence: difference between cross entropy and entropy

KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead ofp(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q) MinimisingH(p,q) → minimising the KL divergence fromq to p

(17)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q) Kullback-Leibler (KL) divergence: difference between

cross entropy and entropy

KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead ofp(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q) MinimisingH(p,q) → minimising the KL divergence fromq to p

(18)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q) Kullback-Leibler (KL) divergence: difference between

cross entropy and entropy KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead of p(x) (also known as therelative entropyofp with respect toq)

Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q) MinimisingH(p,q) → minimising the KL divergence fromq to p

(19)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q) Kullback-Leibler (KL) divergence: difference between

cross entropy and entropy KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead of p(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q)

MinimisingH(p,q) → minimising the KL divergence fromq to p

(20)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q) Kullback-Leibler (KL) divergence: difference between

cross entropy and entropy KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead of p(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q) MinimisingH(p,q) → minimising the KL divergence fromq to p

(21)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross-entropy loss (or logistic loss)

Use cross entropy to measure the difference between two distributions p and q

Use total cross entropy over all training examples as the loss Lcross−entropy(p,q) =−X

i

pilog(qi)

=−log(qt) for hard classification whereqt is the correct class

J(θ) =−T1log L(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ)

j6= 0

Negative log-likelihood = cross entropy

(22)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Cross-entropy loss (or logistic loss)

Use cross entropy to measure the difference between two distributions p and q

Use total cross entropy over all training examples as the loss Lcross−entropy(p,q) =−X

i

pilog(qi)

=−log(qt) for hard classification whereqt is the correct class

J(θ) =−T1log L(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ)

j6= 0

Negative log-likelihood = cross entropy

(23)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

We want to minimise the objective function:

Cross-entropy loss J(θ) =−1

T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Question: How to calculateP(wt+j|wt;θ) ?

Answer: We will use two vectors per wordw:

vw when w is a center word

uw whenw is a context word

Then for a center wordc and a context word o: P(o|c) = exp(uTovc)

X

w∈V

exp(uTwvc) (3) Take dot products between the two word vectors, put them in Softmax

(24)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

We want to minimise the objective function:

Cross-entropy loss J(θ) =−1

T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Question: How to calculateP(wt+j|wt;θ) ?

Answer: We will use two vectors per wordw:

vw when w is a center word

uw whenw is a context word

Then for a center wordc and a context word o:

P(o|c) = exp(uTovc) X

w∈V

exp(uTwvc) (3)

Take dot products between the two word vectors, put them in Softmax

(25)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Objective function

We want to minimise the objective function:

Cross-entropy loss J(θ) =−1

T

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ) (2)

j6= 0

Question: How to calculateP(wt+j|wt;θ) ?

Answer: We will use two vectors per wordw:

vw when w is a center word

uw whenw is a context word

Then for a center wordc and a context word o:

P(o|c) = exp(uTovc) X

w∈V

exp(uTwvc) (3) Take dot products between the two word vectors, put them in Softmax

(26)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Recap: Dot products

Measure of similarity (well, kind of...)

Bigger if u andv are more similar (if vectors point in the same direction)

u>v=u·v =

n

X

i=1

uivi (4)

Iterating overw = 1. . .W :uw>v

⇒ work out how similar each word is tov P(o|c) = exp(uoTvc)

V

X

w=1

exp(uwTvc)

(5)

(27)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Softmax function

Standard mapping fromRV to a probability distribution

Exponentiate to make positive

p

i

=

PNexi j=1exj

Normalise to get probability

Softmax function maps arbitrary values xi to a probability distribution pi

maxbecause amplifies probability of largestxi

softbecause still assigns some probability to smallerxi This gives us a probability estimate p(wt−1|wt)

(28)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Difference Sigmoid Function – Softmax

Sigmoid Function

binary classification in logistic regression

sum of probabilities not necessarily 1

activation function

Softmax Function

multi-classification in logistic regression

sum of probabilities will be 1

(29)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Why two representations for each word?

We create two representations for each word in the corpus:

1. w as a context word 2. w as a center word

Easier to compute → we can optimise vectors separately

Also works better in practice...

(30)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Predict the label

Dot product compares similarity ofo andc Larger dot product = larger probability p(o|c) = exp(uo>vc)

X

w∈V

exp(uw>vc)

(6)

After taking exponent, normalise over entire vocab

For training the model, compute for all words in the corpus: J(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ)

j6= 0

(31)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Predict the label

Dot product compares similarity ofo andc Larger dot product = larger probability p(o|c) = exp(uo>vc)

X

w∈V

exp(uw>vc)

(6)

After taking exponent, normalise over entire vocab

For training the model, compute for all words in the corpus:

J(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt;θ)

j6= 0

(32)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Training the model

Recall: θrepresents allmodel parameters, in one long vector

For d-dimensional vectors and V-many words:

θ=

 vaas vamaranth

... vzoo uaas

uameise ... uzoo

∈R2dV (7)

Remember: every word has two vectors⇒ 2d

We now optimise the parametersθ

(33)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Training the model

Generative model: predict the context for a given center word

We have an objective function:

J(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt)

We want to minimise the negative log-likelihood (maximise the probability we predict)

Probability distribution: p(o|c) = Xexp(u>ovc)

w∈V

exp(uw>vc)

How do we know how to change the parameters (i.e. the word vectors)?

→ Use the gradient

(34)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Skipgram – Training the model

Generative model: predict the context for a given center word

We have an objective function:

J(θ) =−T1

T

X

t=1

X

−m≤j≤m

log P(wt+j|wt)

We want to minimise the negative log-likelihood (maximise the probability we predict)

Probability distribution: p(o|c) = Xexp(u>ovc)

w∈V

exp(uw>vc)

How do we know how to change the parameters (i.e. the word vectors)? → Use the gradient

(35)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Minimising the objective function

We want to optimise (maximise or minimise) our objective function

How do we know how to change the parameters?

Use the gradient

Gradient ∇J(θ) of a function givesdirection of steepest ascent

Gradient Descent is an algorithm to minimiseJ(θ)

(36)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent – Intuition

Idea:

for a current value ofθ, calculate gradient ofJ(θ)

then take a small step in the direction of the negative gradient

repeat

(37)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent – Intuition

Find local minimum for a given cost function

at each step, GD tells us in which direction to move to lower the cost

No guarantee that we find the best global solution!

(38)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent – Intuition

How do we know the direction?

Best guess: move in the direction of the slope (gradient) of the cost function

Arrows: gradient of the cost function at different points

(39)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent – Intuition

Gradient of a function

vector that points in the direction of the steepest ascent

Gradient is deeply connected to its derivative

Derivativef0 of a function

a single number that indicates how fast the function is rising when moving in the direction of its gradient

f0(p): value off0 at pointp

f0(p)>0f is going up

f0(p)<0f is going down

f0(p) = 0f is flat

(40)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

We want to optimise (maximise or minimise) it by updating x minx∈Rf(x)

Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(41)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

minx∈Rf(x)

Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(42)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

minx∈Rf(x)

Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(43)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

thederivative f0(x) of this function is dydx

gives the slope of f(x) at pointx

⇒ tells us how to changex

to make a small improvement in y:

xi =xi−1−αf0(xi) α =step sizeor learning rate

Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(44)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

thederivative f0(x) of this function is dydx

gives the slope of f(x) at pointx

⇒ tells us how to changex

to make a small improvement in y:

xi =xi−1−αf0(xi) α =step sizeor learning rate

Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(45)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

thederivative f0(x) of this function is dydx

gives the slope of f(x) at pointx

⇒ tells us how to changex

to make a small improvement in y:

xi =xi−1−αf0(xi) α =step sizeor learning rate

Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(46)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

We can usepartial derivatives ∂x

if(x)

measures howf changes as onlyxi increases at pointx

Gradient of f:

gives direction of steepest ascent xf(x)

vector containing all partial derivatives forf(x)

Elementi of the gradient ∇is the partial derivative of f with respect to xi

Which direction should we step to decrease the function?

Gradient descent algorithm:

computexf(x)

take small step in−∇xf(x) direction

repeat

Minimise f by applying small updates to x: x0=x−α∇xf(x)

(47)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

We can usepartial derivatives ∂x

if(x)

measures howf changes as onlyxi increases at pointx

Gradient of f:

gives direction of steepest ascent xf(x)

vector containing all partial derivatives forf(x)

Elementi of the gradient ∇is the partial derivative of f with respect to xi

Which direction should we step to decrease the function?

Gradient descent algorithm:

computexf(x)

take small step in−∇xf(x) direction

repeat

Minimise f by applying small updates to x: x0=x−α∇xf(x)

(48)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

We can usepartial derivatives ∂x

if(x)

measures howf changes as onlyxi increases at pointx

Gradient of f:

gives direction of steepest ascent xf(x)

vector containing all partial derivatives forf(x)

Elementi of the gradient ∇is the partial derivative of f with respect to xi

Which direction should we step to decrease the function?

Gradient descent algorithm:

computexf(x)

take small step in−∇xf(x) direction

repeat

Minimise f by applying small updates to x: x0=x−α∇xf(x)

(49)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

We can usepartial derivatives ∂x

if(x)

measures howf changes as onlyxi increases at pointx

Gradient of f:

gives direction of steepest ascent xf(x)

vector containing all partial derivatives forf(x)

Elementi of the gradient ∇is the partial derivative of f with respect to xi

Which direction should we step to decrease the function?

Gradient descent algorithm:

computexf(x)

take small step in−∇xf(x) direction

repeat

Minimisef by applying small updates to x: x0=x−α∇xf(x)

(50)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

Critical points in 2D (one input value):

(51)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

Critical points in 3D:

(52)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

Update equation (in matrix notation):

θ

new

= θ

old

− α∇

θ

J(θ)

α = step sizeor learning rate

Update equation (for a single parameter):

θ

newj

= θ

oldj

− α

∂θold j

J (θ)

(53)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Gradient Descent with multiple inputs

Problem: J(θ) is a function of allwindows in the corpus (extremely large!)

SoθJ(θ) is very expensiveto compute

Takes too long for a single update!

Solution: Stochastic Gradient Descent

Repeatedly sample windows and update after each one

(54)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

Goal: find parametersθ that reduce cost function J(θ) Algorithm 1Pseudocode for SGD

1: Input:

2: – function f(x;θ)

3: – training set of inputs x1, . . . ,xn and gold outputsy1, . . . ,yn 4: – loss function J

5: while stopping criteria not met do

6: Sample a training example xi,yi 7: Compute the lossJ(f(xi;θ),yi)

8: ∇ ←gradients ofJ(f(xi;θ),yi) w.r.t. θ

9: Updateθ←θ−α∇

10: end while

11: return θ

(55)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

Goal: find parametersθ that reduce cost function J(θ)

Impact oflearning rate α:

too lowlearning proceeds slowly

initialαtoo lowlearning may become stuck with high cost

(56)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

Goal: find parametersθ that reduce cost function J(θ)

Important property of SGD (and related minibatch or online gradient-based optimization)

computation time per update does not grow with increasing number of training examples

(57)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

θ=

w0 w1 w2 ... w19.998

w19.999

w20.000

−∇J(θ) =

0.31 0.03

−1.25 ... 0.78

−0.37 0.16

w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little

(58)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

θ=

w0 w1 w2 ... w19.998

w19.999

w20.000

−∇J(θ) =

0.31 0.03

−1.25 ... 0.78

−0.37 0.16

w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little

(59)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

θ=

w0 w1 w2 ... w19.998

w19.999

w20.000

−∇J(θ) =

0.31 0.03

−1.25 ... 0.78

−0.37 0.16

w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little Average over all training data

Encodes the relative importance of each weight

(60)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

θ=

w0 w1 w2 ... w19.998

w19.999

w20.000

−∇J(θ) =

0.31 0.03

−1.25 ... 0.78

−0.37 0.16

w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little Average over all training data

Encodes the relative importance of each weight

(61)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

Make a forward passthrough the network to compute the output

Take the output that the network predicts

Take the output that it should predict

Compute the total cost of the networkJ(θ) Propagate the error back through the network

⇒Backpropagation

procedure to compute the gradient of the cost function:

(62)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

Make a forward passthrough the network to compute the output

Take the output that the network predicts

Take the output that it should predict

Compute the total cost of the networkJ(θ) Propagate the error back through the network

⇒Backpropagation

procedure to compute the gradient of the cost function:

Compute the partial derivatives ∂J(θ)∂w and ∂J(θ)∂b of the cost function J(θ) with respect to any weight w or bias b in the network.

(63)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Stochastic Gradient Descent (SGD)

Make a forward passthrough the network to compute the output

Take the output that the network predicts

Take the output that it should predict

Compute the total cost of the networkJ(θ) Propagate the error back through the network

⇒Backpropagation

procedure to compute the gradient of the cost function:

How do we have to change the weights and biases in order to change the cost?

(64)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Parameter initialisation

Before we start training the network we have to initialise the parameters

Why not use zero as initial values?

Not a good idea, outputs will be the same for all nodes

Instead, use small random numbers, e.g.:

use normally distributed values around zeroN(0,0.1)

use Xavier initialisation (Glorot and Bengio 2010)

for debugging: use fixed random seeds

Now let’s start the training:

predict labels

compute loss

update parameters

(65)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Forward pass

Computes the output of the network

Each node’s output depends only on itself and on its incoming edges

Traverse the nodes and compute the output of each node, given the already computed outputs of its predecessors

Image taken fromhttp://neuralnetworksanddeeplearning.com/chap2.html

(66)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Forward pass

Computes the output of the network

Each node’s output depends only on itself and on its incoming edges

Traverse the nodes and compute the output of each node, given the already computed outputs of its predecessors

alj =σ X

k

wjkl akl−1+bjl

Image taken fromhttp://neuralnetworksanddeeplearning.com/chap2.html

(67)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Forward pass

Computes the output of the network

Each node’s output depends only on itself and on its incoming edges

Traverse the nodes and compute the output of each node, given the already computed outputs of its predecessors

in vector terminology al =σ wlal−1+bl

Image taken fromhttp://neuralnetworksanddeeplearning.com/chap2.html

(68)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Forward pass

Computes the output of the network

Each node’s output depends only on itself and on its incoming edges

Traverse the nodes and compute the output of each node, given the already computed outputs of its predecessors

in vector terminology zl =wlal−1+bl al =σ(zl)

Image taken fromhttp://neuralnetworksanddeeplearning.com/chap2.html

(69)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Parameter update for a 1-layer network

After a single forward pass, predict the output ˆy

Compute the cost J (a single scalar value), given the predicted ˆy and the ground truthy

Take the derivative of the cost J w.r.t w andb

Updatew andb by a fraction (learning rate) ofdw and db

(70)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Parameter update for a 1-layer network

Forward pass:

Z =W>X +b ˆ

y=A=σ(Z)

Use the chain rule:

dJ

dW = dAdJdAdZdWdZ

dJ

db = dAdJdAdZdZdb

Updatew andb:

W =W −αdWdJ b=b−αdJdb

Image taken fromhttp://www.adeveloperdiary.com/data-science/machine-learning/

understand-and-implement-the-backpropagation- algorithm- from-scratch-in-python/

Referenzen

ÄHNLICHE DOKUMENTE

Einf¨ uhrung in die Supersymmetrie SS

[r]

The lower dashed horizontal line in the negative parity channel is the sum of the nucleon and kaon mass at rest in the ground state obtained from a separate calculation on the

Aufgabe 2 (50 Punkte) Es gebe zwei Agenten i = 1, 2 , die jeweils eine Einheit eines Gutes. besitzen (d.h. es gibt insgesamt

[r]

im Ursprung das Potential nicht unendlich sein kann, da dort keine Ladung ist. Für das äuÿere Potential folgt die Vereinfachung, dass C lm =

(Zwei Elektronen, weil sich diese noch in der Spin-Einstellung unterscheiden können.) In jede n,l Schale können also 2(l+1) Elektronen gepackt werden.. (Dieses Ergebnis

However, each time you apply it, do not forget to check that the hypotheses of the rule are satisfied (in relation to this see the bonus question on the back).. please