Word2vec embeddings: CBOW and Skipgram

(1)

Skipgram – Intuition Gradient Descent Stochastic Gradient Descent Backpropagation

Word2vec embeddings: CBOW and Skipgram

VL Embeddings

Uni Heidelberg

SS 2019

(2)

Skipgram – Intuition

• Window size: 2

• Center word at position t: Maus

P(wt−2|wt) P(wt−1|wt) P(wt+1|wt) P(wt+2|wt)

Die kleine graue Maus frißt den leckeren K¨ase

w_t−2 w_t−1 wt wt+1 wt+2

Same probability distribution used for all context words

(3)

Skipgram – Intuition

• Window size: 2

• Center word at position t: frißt

(4)

Skipgram – Intuition

• Window size: 2

• Center word at position t:

(5)

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|w_t;θ) (1) Likelihood =

j6= 0 What is θ?

J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t;θ) (2)

j6= 0

Minimising objective function⇔ maximising predictive accuracy

(6)

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordwj .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(wt+j|w_t;θ) (1) Likelihood =

j6= 0 θ: vector representations of each word

J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t;θ) (2)

j6= 0

(7)

Skipgram – Objective function

For each positiont= 1, ...,T, predict context words within a window of fixed sizem, given center wordw_j .

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(w_t+j|w_t;θ) (1) Likelihood =

j6= 0 θ: vector representations of each word Objective function(cost function, loss function): Maximise the probability of any context word given the current center wordwt

J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

log P(wt+j|w_t;θ) (2)

j6= 0

(8)

Skipgram – Objective function

L(θ) =

T

Y

t=1

Y

−m≤j≤m

j6= 0 θ: vector representations of each word Theobjective functionJ(θ) is the (average) negative log-likelihood:

(cost function, loss function) J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

j6= 0

(9)

Skipgram – Objective function

L(θ) =

T

Y

t=1

Y

−m≤j≤m

j6= 0 θ: vector representations of each word Theobjective functionJ(θ) is the (average) negative log-likelihood:

(cost function, loss function) J(θ) =−1

Tlog L(θ) =−1 T

T

X

t=1

X

−m≤j≤m

j6= 0

(10)

Objective function – Motivation

• We want to model the probability distribution over mutually exclusive classes

• measure the difference between predicted probabilities ˆy and ground-truth probabilitiesy

• during training: tune parameters so that this difference is minimised

(11)

Negative log-likelihood

Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)?

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(w_t+j|w_t;θ) MLE =argmax L(θ,x)

• The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties)

• arg max

x

(x) is equivalent to arg min

x

(−x)

J(θ) =−_T¹log L(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(wt+j|w_t;θ)

(12)

Negative log-likelihood

Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)?

L(θ) =

T

Y

t=1

Y

−m≤j≤m

P(w_t+j|w_t;θ) MLE =argmax L(θ,x)

• The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties)

• arg max

x

(x) is equivalent to arg min

x

(−x)

J(θ) =−_T¹log L(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(wt+j|w_t;θ)

(13)

Negative log-likelihood

• We can interpret negative log-probability as information content or surprisal

What is the log-likelihood of a model, given an event?

⇒ The negative of the surprisal of the event, given the model:

A model is supported by an event to the extent that the event is unsurprising, given the model.

(14)

Cross entropy loss

Negative log likelihood is the same as cross entropy Recap: Entropy

• If a discrete random variableX has the probability p(x), then the entropy ofX is

H(X) =X

x

p(x)log 1

p(x) =−X

x

p(x)log p(x)

⇒ expected number of bits needed to encodeX if we use an optimal coding scheme

Cross entropy

⇒ number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x)

H(p,q) =X

x

p(x)log 1

q(x) =−X

x

p(x)log q(x)

(15)

Cross entropy loss

Negative log likelihood is the same as cross entropy Recap: Entropy

• If a discrete random variableX has the probability p(x), then the entropy ofX is

H(X) =X

x

p(x)log 1

p(x) =−X

x

p(x)log p(x)

⇒ expected number of bits needed to encodeX if we use an optimal coding scheme

Cross entropy

⇒ number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x)

H(p,q) =X

x

p(x)log 1

q(x) =−X

x

p(x)log q(x)

(16)

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q)

Kullback-Leibler (KL) divergence: difference between cross entropy and entropy

KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead ofp(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q) MinimisingH(p,q) → minimising the KL divergence fromq to p

(17)

Cross entropy loss and Kullback-Leibler divergence

Cross entropy is always larger than entropy (exception: ifp=q) Kullback-Leibler (KL) divergence: difference between

cross entropy and entropy

KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead ofp(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

(18)

Cross entropy loss and Kullback-Leibler divergence

cross entropy and entropy KL(p||q) =X

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead of p(x) (also known as therelative entropyofp with respect toq)

Cross entropy:

H(p,q) =−X

x∈X

(19)

Cross entropy loss and Kullback-Leibler divergence

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead of p(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

p(x) log q(x) =H(p) +KL(p||q)

MinimisingH(p,q) → minimising the KL divergence fromq to p

(20)

Cross entropy loss and Kullback-Leibler divergence

x

p(x)log 1

q(x)−X

x

p(x)log 1

p(x) =X

x

p(x)logp(x) q(x)

⇒number of extra bits needed when usingq(x) instead of p(x) (also known as therelative entropyofp with respect toq) Cross entropy:

H(p,q) =−X

x∈X

(21)

Cross-entropy loss (or logistic loss)

• Use cross entropy to measure the difference between two distributions p and q

• Use total cross entropy over all training examples as the loss Lcross−entropy(p,q) =−X

i

p_ilog(q_i)

=−log(q_t) for hard classification whereqt is the correct class

J(θ) =−_T¹log L(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t;θ)

j6= 0

Negative log-likelihood = cross entropy

(22)

Cross-entropy loss (or logistic loss)

• Use cross entropy to measure the difference between two distributions p and q

• Use total cross entropy over all training examples as the loss Lcross−entropy(p,q) =−X

i

p_ilog(q_i)

=−log(q_t) for hard classification whereqt is the correct class

J(θ) =−_T¹log L(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t;θ)

j6= 0

Negative log-likelihood = cross entropy

(23)

Skipgram – Objective function

We want to minimise the objective function:

Cross-entropy loss J(θ) =−1

T

X

t=1

X

−m≤j≤m

j6= 0

• Question: How to calculateP(wt+j|w_t;θ) ?

• Answer: We will use two vectors per wordw:

• v_w when w is a center word

• u_w whenw is a context word

• Then for a center wordc and a context word o: P(o|c) = exp(u^T_ovc)

X

w∈V

exp(u^T_wv_c) (3) Take dot products between the two word vectors, put them in Softmax

(24)

Skipgram – Objective function

T

X

t=1

X

−m≤j≤m

j6= 0

• Then for a center wordc and a context word o:

P(o|c) = exp(u^T_ovc) X

w∈V

exp(u^T_wv_c) (3)

Take dot products between the two word vectors, put them in Softmax

(25)

Skipgram – Objective function

T

X

t=1

X

−m≤j≤m

j6= 0

• Then for a center wordc and a context word o:

P(o|c) = exp(u^T_ovc) X

w∈V

exp(u^T_wv_c) (3) Take dot products between the two word vectors, put them in Softmax

(26)

Recap: Dot products

• Measure of similarity (well, kind of...)

• Bigger if u andv are more similar (if vectors point in the same direction)

u^>v=u·v =

n

X

i=1

uivi (4)

• Iterating overw = 1. . .W :u_w^>v

⇒ work out how similar each word is tov P(o|c) = exp(u_o^Tvc)

V

X

w=1

exp(u_w^Tvc)

(5)

(27)

Softmax function

Standard mapping fromR^V to a probability distribution

Exponentiate to make positive

p

_i

=

_PN^e^xi j=1e^xj

Normalise to get probability

• Softmax function maps arbitrary values x_i to a probability distribution p_i

• maxbecause amplifies probability of largestx_i

• softbecause still assigns some probability to smallerx_i This gives us a probability estimate p(wt−1|w_t)

(28)

Difference Sigmoid Function – Softmax

Sigmoid Function

• binary classification in logistic regression

• sum of probabilities not necessarily 1

• activation function

Softmax Function

• multi-classification in logistic regression

• sum of probabilities will be 1

(29)

Why two representations for each word?

• We create two representations for each word in the corpus:

1. w as a context word 2. w as a center word

• Easier to compute → we can optimise vectors separately

• Also works better in practice...

(30)

Skipgram – Predict the label

Dot product compares similarity ofo andc Larger dot product = larger probability p(o|c) = exp(u_o^>v_c)

X

w∈V

exp(u_w^>vc)

(6)

After taking exponent, normalise over entire vocab

• For training the model, compute for all words in the corpus: J(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t;θ)

j6= 0

(31)

Skipgram – Predict the label

Dot product compares similarity ofo andc Larger dot product = larger probability p(o|c) = exp(u_o^>v_c)

X

w∈V

exp(u_w^>vc)

(6)

After taking exponent, normalise over entire vocab

• For training the model, compute for all words in the corpus:

J(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t;θ)

j6= 0

(32)

Skipgram – Training the model

• Recall: θrepresents allmodel parameters, in one long vector

• For d-dimensional vectors and V-many words:

θ=





 v_aas vamaranth

... v_zoo uaas

u_ameise ... uzoo







∈R^2dV (7)

• Remember: every word has two vectors⇒ 2d

• We now optimise the parametersθ

(33)

Skipgram – Training the model

Generative model: predict the context for a given center word

• We have an objective function:

J(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t)

• We want to minimise the negative log-likelihood (maximise the probability we predict)

• Probability distribution: p(o|c) = X^exp(u^>^o^v^c⁾

w∈V

exp(u_w^>v_c)

• How do we know how to change the parameters (i.e. the word vectors)?

→ Use the gradient

(34)

Skipgram – Training the model

Generative model: predict the context for a given center word

• We have an objective function:

J(θ) =−_T¹

T

X

t=1

X

−m≤j≤m

log P(w_t+j|w_t)

• We want to minimise the negative log-likelihood (maximise the probability we predict)

• Probability distribution: p(o|c) = X^exp(u^>^o^v^c⁾

w∈V

exp(u_w^>v_c)

• How do we know how to change the parameters (i.e. the word vectors)? → Use the gradient

(35)

Minimising the objective function

We want to optimise (maximise or minimise) our objective function

• How do we know how to change the parameters?

Use the gradient

• Gradient ∇J(θ) of a function givesdirection of steepest ascent

• Gradient Descent is an algorithm to minimiseJ(θ)

(36)

Gradient Descent – Intuition

• Idea:

• for a current value ofθ, calculate gradient ofJ(θ)

• then take a small step in the direction of the negative gradient

• repeat

(37)

Gradient Descent – Intuition

• Find local minimum for a given cost function

• at each step, GD tells us in which direction to move to lower the cost

• No guarantee that we find the best global solution!

(38)

Gradient Descent – Intuition

• How do we know the direction?

• Best guess: move in the direction of the slope (gradient) of the cost function

• Arrows: gradient of the cost function at different points

(39)

Gradient Descent – Intuition

• Gradient of a function

• vector that points in the direction of the steepest ascent

• Gradient is deeply connected to its derivative

• Derivativef⁰ of a function

• a single number that indicates how fast the function is rising when moving in the direction of its gradient

• f⁰(p): value off⁰ at pointp

• f⁰(p)>0⇒f is going up

• f⁰(p)<0⇒f is going down

• f⁰(p) = 0⇒f is flat

(40)

Gradient-Based Optimisation

Given some functiony =f(x) with x,y ∈R

• We want to optimise (maximise or minimise) it by updating x minx∈Rf(x)

• Gradient Descent: reduce f(x) by movingx in small steps with the opposite sign of the derivative

What if we have functions with multiple inputs?

(41)

Gradient-Based Optimisation

• minx∈Rf(x)

(42)

Gradient-Based Optimisation

• minx∈Rf(x)

(43)

Gradient-Based Optimisation

• thederivative f⁰(x) of this function is ^dy_dx

• gives the slope of f(x) at pointx

⇒ tells us how to changex

to make a small improvement in y:

x_i =xi−1−αf⁰(x_i) α =step sizeor learning rate

(44)

Gradient-Based Optimisation

(45)

Gradient-Based Optimisation

(46)

Gradient Descent with multiple inputs

• We can usepartial derivatives _∂x^∂

if(x)

• measures howf changes as onlyxi increases at pointx

• Gradient of f:

• gives direction of steepest ascent ∇xf(x)

• vector containing all partial derivatives forf(x)

• Elementi of the gradient ∇is the partial derivative of f with respect to x_i

Which direction should we step to decrease the function?

• Gradient descent algorithm:

• compute∇xf(x)

• take small step in−∇xf(x) direction

• repeat

• Minimise f by applying small updates to x: x⁰=x−α∇_xf(x)

(47)

Gradient Descent with multiple inputs

if(x)

• Gradient of f:

• compute∇xf(x)

• repeat

(48)

Gradient Descent with multiple inputs

if(x)

• Gradient of f:

• compute∇xf(x)

• repeat

(49)

Gradient Descent with multiple inputs

if(x)

• Gradient of f:

• compute∇xf(x)

• repeat

• Minimisef by applying small updates to x: x⁰=x−α∇_xf(x)

(50)

Gradient Descent with multiple inputs

Critical points in 2D (one input value):

(51)

Gradient Descent with multiple inputs

Critical points in 3D:

(52)

Gradient Descent with multiple inputs

• Update equation (in matrix notation):

θ

^new

= θ

^old

− α∇

_θ

J(θ)

α = step sizeor learning rate

• Update equation (for a single parameter):

θ

^new_j

= θ

^old_j

− α

_∂θ^∂old j

J (θ)

(53)

Gradient Descent with multiple inputs

• Problem: J(θ) is a function of allwindows in the corpus (extremely large!)

• So∇_θJ(θ) is very expensiveto compute

⇒ Takes too long for a single update!

• Solution: Stochastic Gradient Descent

• Repeatedly sample windows and update after each one

(54)

Stochastic Gradient Descent (SGD)

Goal: find parametersθ that reduce cost function J(θ) Algorithm 1Pseudocode for SGD

1: Input:

2: – function f(x;θ)

3: – training set of inputs x1, . . . ,xn and gold outputsy1, . . . ,yn 4: – loss function J

5: while stopping criteria not met do

6: Sample a training example xi,yi 7: Compute the lossJ(f(x_i;θ),y_i)

8: ∇ ←gradients ofJ(f(x_i;θ),y_i) w.r.t. θ

9: Updateθ←θ−α∇

10: end while

11: return θ

(55)

Stochastic Gradient Descent (SGD)

Goal: find parametersθ that reduce cost function J(θ)

• Impact oflearning rate α:

• too low→learning proceeds slowly

• initialαtoo low→learning may become stuck with high cost

(56)

Stochastic Gradient Descent (SGD)

Goal: find parametersθ that reduce cost function J(θ)

• Important property of SGD (and related minibatch or online gradient-based optimization)

• computation time per update does not grow with increasing number of training examples

(57)

Stochastic Gradient Descent (SGD)

θ=





 w₀ w₁ w₂ ... w19.998

w19.999

w20.000







−∇J(θ) =





 0.31 0.03

−1.25 ... 0.78

−0.37 0.16







w₀ should increasesomewhat w₁ should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w_20.000 should increasea little

(58)

Stochastic Gradient Descent (SGD)

θ=





 w₀ w₁ w₂ ... w19.998

w19.999

w20.000







−∇J(θ) =





 0.31 0.03

−1.25 ... 0.78

−0.37 0.16







w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little

(59)

Stochastic Gradient Descent (SGD)

θ=





 w₀ w₁ w₂ ... w19.998

w19.999

w20.000







−∇J(θ) =





 0.31 0.03

−1.25 ... 0.78

−0.37 0.16







w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little Average over all training data

Encodes the relative importance of each weight

(60)

Stochastic Gradient Descent (SGD)

θ=





 w₀ w₁ w₂ ... w19.998

w19.999

w20.000







−∇J(θ) =





 0.31 0.03

−1.25 ... 0.78

−0.37 0.16







w0 should increasesomewhat w1 should increasea little w2should decrease a lot ... w19.998should increase a lot w19.999 should decreasesomewhat w20.000 should increasea little Average over all training data

Encodes the relative importance of each weight

(61)

Stochastic Gradient Descent (SGD)

• Make a forward passthrough the network to compute the output

• Take the output that the network predicts

• Take the output that it should predict

• Compute the total cost of the networkJ(θ) Propagate the error back through the network

⇒Backpropagation

• procedure to compute the gradient of the cost function:

(62)

Stochastic Gradient Descent (SGD)

⇒Backpropagation

Compute the partial derivatives ^∂J(θ)_∂w and ^∂J(θ)_∂b of the cost function J(θ) with respect to any weight w or bias b in the network.

(63)

Stochastic Gradient Descent (SGD)

⇒Backpropagation

How do we have to change the weights and biases in order to change the cost?

(64)

Parameter initialisation

• Before we start training the network we have to initialise the parameters

• Why not use zero as initial values?

• Not a good idea, outputs will be the same for all nodes

• Instead, use small random numbers, e.g.:

• use normally distributed values around zeroN(0,0.1)

• use Xavier initialisation (Glorot and Bengio 2010)

• for debugging: use fixed random seeds

• Now let’s start the training:

• predict labels

• compute loss

• update parameters

(65)

Forward pass

• Computes the output of the network

• Each node’s output depends only on itself and on its incoming edges

• Traverse the nodes and compute the output of each node, given the already computed outputs of its predecessors

Image taken fromhttp://neuralnetworksanddeeplearning.com/chap2.html

(66)

Forward pass

a^l_j =σ X

k

w_jk^l a_k^l−1+b_j^l

(67)

Forward pass

in vector terminology a^l =σ w^la^l−1+b^l

(68)

Forward pass

in vector terminology z^l =w^la^l−1+b^l a^l =σ(z^l)

(69)

Parameter update for a 1-layer network

• After a single forward pass, predict the output ˆy

• Compute the cost J (a single scalar value), given the predicted ˆy and the ground truthy

• Take the derivative of the cost J w.r.t w andb

• Updatew andb by a fraction (learning rate) ofdw and db

(70)

Parameter update for a 1-layer network

Forward pass:

Z =W^>X +b ˆ

y=A=σ(Z)

Use the chain rule:

dJ

dW = _dA^dJ^dA_dZ_dW^dZ

dJ

db = _dA^dJ^dA_dZ^dZ_db

Updatew andb:

W =W −α_dW^dJ b=b−α^dJ_db

Image taken fromhttp://www.adeveloperdiary.com/data-science/machine-learning/

understand-and-implement-the-backpropagation- algorithm- from-scratch-in-python/