Uncovering divergent linguistic information in word embeddings

(1)

Uncovering divergent linguistic information in word embeddings

VL Embeddings

Uni Heidelberg

SS 2019

(2)

Uncovering linguistic information in word embeddings

Artetxe et al (2018): Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

• Word embeddings capture more information than we can directly observe

• We can apply linear transformations to pretrained embeddings to adjust performance to different tasks along the axes of similarity/relatednessandsemantics/syntax

FastText Dep-based embeddings GloVe good syntactic functional semantic for analogies similarities analogies

(3)

Uncovering linguistic information in word embeddings

Artetxe et al (2018): Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

• Word embeddings capture more information than we can directly observe

• We can apply linear transformations to pretrained embeddings to adjust performance to different tasks along the axes of similarity/relatednessandsemantics/syntax

FastText Dep-based embeddings GloVe good syntactic functional semantic for analogies similarities analogies

(4)

Linear transformation of embedding matrix

Artetxe et al (2018)

• Let X be the word embeddings matrix

• Let X_i be the embedding of the ith word in the vocabulary

• The dot productsim(i,j) =X_i ·X_j is a measure of the similarity between the ith and thejth word

• Define the similarity matrix M(X) :=XX^> so that sim(i,j) =M(X)_ij ⇒ first order similarity We can also define a second order similarity measure:

first order: How similar arewi andwj?

second order: How similar are the contexts ofwi andwj? We can even define a third, fourth or nth order similarity Idea: Some higher order similarities might be better at capturing specific aspects of language.

(5)

Linear transformation of embedding matrix

• The dot productsim(i,j) =X_i·X_j is a measure of the similarity between the ith and thejth word

• Define the similarity matrix M(X) :=XX^> so that sim(i,j) =M(X)_ij ⇒ first order similarity We can also define a second order similarity measure:

(6)

Linear transformation of embedding matrix

• Define the similarity matrix M(X) :=XX^> so that sim(i,j) =M(X)ij ⇒ first order similarity We can also define a second order similarity measure:

(7)

Linear transformation of embedding matrix

X =

cat -0.19 0.45 -0.40 dog -0.28 0.43 -0.39 blue 0.02 -0.40 -0.39 red 0.03 -0.22 -0.31 happy -0.03 -0.07 -0.19

X^>=

cat dog blue red happy

-0.19 -0.28 0.02 0.03 -0.03 0.45 0.43 -0.40 -0.22 -0.07 -0.40 -0.39 -0.39 -0.31 -0.19

(8)

Linear transformation of embedding matrix

X =

X^>=

-0.19 -0.28 0.02 0.03 -0.03 0.45 0.43 -0.40 -0.22 -0.07 -0.40 -0.39 -0.39 -0.31 -0.19

(9)

Linear transformation of embedding matrix

X =

sim(i,j) =Xi ·Xj

X^>=

-0.19 -0.28 0.02 0.03 -0.03 0.45 0.43 -0.40 -0.22 -0.07 -0.40 -0.39 -0.39 -0.31 -0.19

(10)

Linear transformation of embedding matrix

X =

sim(i,j) =Xi ·Xj

X^>=

-0.19 -0.28 0.02 0.03 -0.03 0.45 0.43 -0.40 -0.22 -0.07 -0.40 -0.39 -0.39 -0.31 -0.19

M(X) :=XX^>

sim(i,j) =M(X)ij

(11)

Linear transformation of embedding matrix

• Define the similarity matrix M(X) :=XX^> so that sim(i,j) =M(X)ij ⇒ first order similarity

• We can also define a second order similarity measure:

• first order: How similar arewi andwj?

• second order: How similar are the contexts ofwi andwj?

• We can even define a third, fourth or nth order similarity Idea: Some higher order similarities might be better at capturing specific aspects of language.

(12)

Linear transformation of embedding matrix

(13)

Linear transformation of embedding matrix

• We can even define a third, fourth or nth order similarity

Idea: Some higher order similarities might be better at capturing specific aspects of language.

(14)

Linear transformation of embedding matrix

(15)

Linear transformation of embedding matrix

• Define thefirst ordersimilarity matrix as

M(X) :=XX^> so that sim(i,j) =M(X)ij

• Define thesecond order similarity matrix as

M2(X) :=M(M(X)) so that sim2(i,j) =M2(X)ij

whereM₂(X) =XX^>XX^>

Define then-th order similarity matrix as

M_n(X) = (XX^>)ⁿ so that sim_n(i,j) =M_n(X)_ij Instead of changing the similarity measure, we can also change the word embeddings themselves through a linear transformation so they directly capture this second orn−th order similarity

(16)

Linear transformation of embedding matrix

Definition: M₂(X) :=M(M(X))

M₂(X) = M(M(X))

= M(X)M(X)^>

= XX^>(XX^>)^> (AB)^> =B^>A^>

= XX^>X^>^>X^>

= XX^>XX^>

(17)

Linear transformation of embedding matrix

Definition: M₂(X) :=M(M(X)) M₂(X) = M(M(X))

= M(X)M(X)^>

= XX^>(XX^>)^> (AB)^> =B^>A^>

= XX^>X^>^>X^>

= XX^>XX^>

(18)

Linear transformation of embedding matrix

= M(X)M(X)^>

= XX^>(XX^>)^> (AB)^> =B^>A^>

= XX^>X^>^>X^>

= XX^>XX^>

(19)

Linear transformation of embedding matrix

= M(X)M(X)^>

= XX^>(XX^>)^> (AB)^>=B^>A^>

= XX^>X^>^>X^>

= XX^>XX^>

(20)

Linear transformation of embedding matrix

= M(X)M(X)^>

= XX^>(XX^>)^> (AB)^>=B^>A^>

= XX^>X^>^>X^>

= XX^>XX^>

(21)

Linear transformation of embedding matrix

= M(X)M(X)^>

= XX^>(XX^>)^> (AB)^>=B^>A^>

= XX^>X^>^>X^>

= XX^>XX^>

(22)

Linear transformation of embedding matrix

• Define the second order similarity matrix as

M₂(X) :=XX^>XX^> so that sim₂(i,j) =M₂(X)_ij whereM₂(X) =M(M(X))

• Define then-th order similarity matrix as

Mn(X) = (XX^>)ⁿ so that simn(i,j) =Mn(X)ij

Instead of changing the similarity measure, we can also change the word embeddings themselves through a linear transformation so they directly capture this second orn−th order similarity

(23)

Linear transformation of embedding matrix

(24)

Linear transformation of embedding matrix

(25)

Linear transformation of embedding matrix

• Let X^>X =QΛQ^> be the eigendecomposition of X^>X

• Λ is a positive diagonal matrix whose entries are the eigenvalues ofX^>X andQ is an orthogonal matrix with their respective eigenvectors as columns

Define the linear transformation matrix W =Q√ Λ Apply W to the original embeddingsX ⇒ X⁰ =XW M(X⁰) =M2(X)X ⇒transformed embeddings X⁰ capture the second order similarity as defined for the original

embeddings

(26)

Linear transformation of embedding matrix

• Λ is a positive diagonal matrix whose entries are the eigenvalues ofX^>X andQ is an orthogonal matrix with their respective eigenvectors as columns Define the linear transformation matrix W =Q√

Λ Apply W to the original embeddingsX ⇒ X⁰ =XW M(X⁰) =M2(X)X ⇒ transformed embeddings X⁰ capture the second order similarity as defined for the original

embeddings

(27)

Linear transformation of embedding matrix

X = U · Σ · Q^>

m×n m×m m×n n×n SVD

X^>X = (U ·Σ·Q^>)^> · (U·Σ·Q^>)

= (Q^>)^>·Σ^>·U^> · U·Σ·Q^> U orthonormal

= Q·Σ^>·Σ·Q^> Σ Diagonalmatrix mit √ EW

= Q·Λ·Q^>

(28)

Linear transformation of embedding matrix

X = U · Σ · Q^>

X^>X = (U·Σ·Q^>)^> · (U·Σ·Q^>)

= Q·Λ·Q^>

(29)

Linear transformation of embedding matrix

X = U · Σ · Q^>

X^>X = (U·Σ·Q^>)^> · (U·Σ·Q^>)

= (Q^>)^>·Σ^>·U^> · U·Σ·Q^>

U orthonormal

= Q·Λ·Q^>

(30)

Linear transformation of embedding matrix

X = U · Σ · Q^>

X^>X = (U·Σ·Q^>)^> · (U·Σ·Q^>)

= Q·Λ·Q^>

(31)

Linear transformation of embedding matrix

X = U · Σ · Q^>

X^>X = (U·Σ·Q^>)^> · (U·Σ·Q^>)

= Q·Σ^>·Σ·Q^>

Σ Diagonalmatrix mit √ EW

= Q·Λ·Q^>

(32)

Linear transformation of embedding matrix

X = U · Σ · Q^>

X^>X = (U·Σ·Q^>)^> · (U·Σ·Q^>)

= Q·Λ·Q^>

(33)

Linear transformation of embedding matrix

X = U · Σ · Q^>

X^>X = (U·Σ·Q^>)^> · (U·Σ·Q^>)

= Q·Λ·Q^>

(34)

Linear transformation of embedding matrix

• Define thelinear transformation matrix W :=Q√ Λ

• Apply W to the original embeddingsX ⇒ X⁰ =XW

• M(X⁰) =M₂(X) ⇒ transformed embeddingsX⁰ capture the second order similarity as defined for the original embeddings

(35)

Linear transformation of embedding matrix

(36)

Linear transformation of embedding matrix

(37)

Linear transformation of embedding matrix

(38)

Linear transformation of embedding matrix

M(X⁰) = X⁰·X^0> first order similarity ofX⁰

= X ·W ·(XW)^> (AB)^>=B^>A^>

= X ·W ·W^>·X^> W :=Q√ Λ

= X ·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^> =M2(X) second order similarity ofX

(39)

Linear transformation of embedding matrix

= X·W ·(XW)^>

(AB)^>=B^>A^>

= X ·W ·W^>·X^> W :=Q√ Λ

= X ·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(40)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X ·W ·W^>·X^> W :=Q√ Λ

= X ·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(41)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^>

W :=Q√ Λ

= X ·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(42)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X ·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(43)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X·Q√

Λ·(Q√

Λ)^>·X^>

(AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(44)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X ·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(45)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X·Q√ Λ·√

Λ^>·Q^>X^>

= X ·Q·Λ·Q^>·X^>

(46)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X·Q√ Λ·√

Λ^>·Q^>X^>

= X·Q·Λ·Q^>·X^>

(47)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X·Q√ Λ·√

Λ^>·Q^>X^>

= X·Q·Λ·Q^>·X^>

= X·X^>·X ·X^>

=M2(X) second order similarity ofX

(48)

Linear transformation of embedding matrix

= X·W ·(XW)^> (AB)^>=B^>A^>

= X·W ·W^>·X^> W :=Q√ Λ

= X·Q√

Λ·(Q√

Λ)^>·X^> (AB)^>=B^>A^>

= X·Q√ Λ·√

Λ^>·Q^>X^>

= X·Q·Λ·Q^>·X^>

= X·X^>·X ·X^> =M2(X) second order similarity ofX

(49)

Linear transformation of embedding matrix

More generally

• DefineW_α:=QΛ^α

whereα is a parameter of the transformation that adjusts to the desired similarity order:

first order similarity α= 0 ⇒ M(XW₀) =M(X) second order similarity α= 0.5 ⇒ M(XW_0.5) =M₂(X) n-th order similarity α= (n−1)/2 ⇒ M(XWα) =Mn(X)

(50)

Linear transformation of embedding matrix

More generally

(51)

Linear transformation of embedding matrix

M(XW₀) = M(X ·Q·Λ⁰) Λ⁰ Einheitsmatrix

= M(X ·Q)

= X ·Q(XQ)^>

= X ·Q·Q^>·X^>

= X ·X^>

= M(X)

(52)

Linear transformation of embedding matrix

= M(X ·Q)

= X ·Q(XQ)^>

= X ·Q·Q^>·X^>

= X ·X^>

= M(X)

(53)

Linear transformation of embedding matrix

= M(X ·Q)

= X ·Q(XQ)^>

= X ·Q·Q^>·X^>

= X ·X^>

= M(X)

(54)

Linear transformation of embedding matrix

= M(X ·Q)

= X ·Q(XQ)^>

= X ·Q·Q^>·X^>

= X ·X^>

= M(X)

(55)

Linear transformation of embedding matrix

= M(X ·Q)

= X ·Q(XQ)^>

= X ·Q·Q^>·X^>

= X ·X^>

= M(X)

(56)

Linear transformation of embedding matrix

= M(X ·Q)

= X ·Q(XQ)^>

= X ·Q·Q^>·X^>

= X ·X^>

= M(X)

(57)

Linear transformation of embedding matrix

More generally

(58)

Linear transformation of embedding matrix

Artetxe et al (2018) M(XW_0.5) = M(X ·Q√

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(59)

Linear transformation of embedding matrix

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(60)

Linear transformation of embedding matrix

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(61)

Linear transformation of embedding matrix

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(62)

Linear transformation of embedding matrix

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(63)

Linear transformation of embedding matrix

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(64)

Linear transformation of embedding matrix

Λ)

= X ·Q√

Λ (X ·Q√ Λ)^>

= X ·Q√ Λ√

Λ^>Q^>·X^>

= X ·Q√ Λ√

Λ·Q^>·X^>

= X ·Q·Λ·Q^>·X^>

= X ·X^>·X ·X^>

= M2(X)

(65)

Linear transformation of embedding matrix

More generally

(66)

Linear transformation of embedding matrix

• Assuming that the embeddingsX capture some second order similarity, it is possible to transform them so that they capture the corresponding first order similarity

• One can easily generalise this to higher order similarities by using smaller values of α

⇒ Parameter α can be used to either increase or decrease the similarity order that we want our embeddings to capture

⇒ α can be continuous

(67)

Linear transformation of different embeddings

(68)

Linear transformation of different embeddings

(69)

Lessons learned for intrinsic and extrinsic evaluations

• Standard intrinsic evaluation is static and incomplete

⇒ Intrinsic evaluation not a good predictor for performance in downstream applications

⇒ Systems that use embeddings as features can learn task-specific optimal balance between the two axes

(70)

References

• Mikolov, Yih and Zweig: (2013): Linguistic regularities in continuous space word representations. NAACL 2013.

• Faruqui, Tsvetkov, Rastogi and Dyer (2016): Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. The 1st Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany.

• Artetxe, Labaka, Lopez-Gazpio and Agirre (2018): Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation. CoNLL 2018. Brussels, Belgium.

• Rubenstein and Goodenough (1965): Contextual correlates of synonymy. Communications of the ACM 8(10):627–633.

• Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162.

• Multimodal Distributional Semantics E. Bruni, N. K. Tran and M. Baroni. Journal of Artificial Intelligence Research 49: 1-47.

• Collobert, Weston Bottou, Karlen, Kavukcuoglu and Kuksa (2011): Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research 12 (2011) 2461-2505.

• Lu, Wang, Bansal, Gimpel and Livescu (2015): Deep multilingual correlation for improved word embeddings. NAACL 2015.

• Rastogi, Van Durme and Arora (2015): Multiview LSA: Representation learning via generalized CCA.

NAACL 2015.

• Chiu, Korhonen and Pyysalo (2016): Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance. ACL 2016.

• Data and Code

• Code for Artetxe etal. (2018):https://github.com/artetxem/uncovec

• The MEN datasethttps://staff.fnwi.uva.nl/e.bruni/MEN

• Datasets for word vector evaluationhttps://github.com/vecto-ai/word-benchmarks