• Keine Ergebnisse gefunden

Neural Network Components

2.2 Neural networks for sequence processing

2.2.2 Neural Network Components

Neural network architectures are constructed using different neural network building blocks. In this section, we discuss various such building blocks that are relevant to NLP and our tasks.

5After speech transcription, if necessary.

2.2 Neural networks for sequence processing Token representations

Natural language consists of discrete elements, such as words, which, as previously discussed, can be decomposed into sub-word units. In order to use such discrete elements in a model, we need to find vector representations for the tokens.

A simple vector representation for tokens that has long been in use is theone-hotvector representation.

This representation can be formalized as a function 𝑓

oneβˆ’hot:V ↦→N0| V |that maps any token𝑀from the vocabularyVonto a vector𝑣of the size of the vocabulary, such that only the element in the vector corresponding to the token𝑀is equal to one and the others are equal to zero:

𝑣𝑖 = (

1, if𝑖=id(𝑀) 0, otherwise

, (2.2)

where the function id(𝑀):V ↦→N0maps the word𝑀to its unique integer id.

Since the one-hot representation do not scale well in practice, and the distance between two words is equal for any two words, tokens are usually represented asdensevectors of relatively low dimensionality. The token embedding function is the a function 𝑓

emb:V ↦→R

𝑑

, where 𝑑is some chosen dimension for the token embedding space. Note that such a dense embedding allows the vector representations of different words to be arbitrarily close to each other, which is useful for handling synonyms. In addition, the dimensionality of the vectors can be kept low and does not directly depend on the vocabulary size. The function 𝑓

embcan be implemented as:

𝑓emb(𝑀) =𝐴·𝑣

𝑀, (2.3)

where𝑣

𝑀 ∈N0| V |is a one-hot vector for word𝑀and𝐴∈R

𝑑× | V |

is a trainable matrix containing the vectors for every word inV.

In the earlier days of deep learning for NLP, several fundamental works have investigated different ways of pre-training word vectors using large unsupervised corpora. Among the most commonly used ones are Word2Vec [39] and GloVe [40]. These methods provide initializations for word vector parameters that capture synonymity and other relations between words, induced from statistics on a large text corpus (see2.4.1). Essentially, similar words obtain similar embeddings, which improves generalization to unobserved words. While recent focus in NLP has shifted towards pretrained transformers (see Section2.4.2), pretrained word embeddings are still used in applications due to their simplicity and efficiency.

Recurrent Neural Networks

Natural languages, as well as formal languages, are sequential in nature, which means that we require models that can process and generalize for different sequence lengths. Recurrent neural networks (RNN) are a type of neural networks suitable for this case. RNNs typically process the sequence one token at a time, and use a β€œstate” variable/vector that describes the sequence elements β€œconsumed” so far. Thus, the typical RNN is a parameterized function 𝑓 :R

𝑑𝑖 Γ—R

π‘‘β„Ž ↦→R

π‘‘β„Ž

that computes a new stateβ„Ž

𝑑 ∈R

𝑑

β„Ž from the previous stateβ„Ž

π‘‘βˆ’1 ∈R

𝑑

β„Ž and current inputπ‘₯

𝑑 ∈R

𝑑 𝑖: β„Žπ‘‘ = 𝑓(β„Žπ‘‘βˆ’

1, π‘₯

𝑑;πœƒ) (2.4)

Chapter 2 Background and Preliminaries

xt ht-1

U

W Ur

Uz Wr Wz

ht

Figure 2.1: Gated Recurrent Unit (GRU)

RNNs usually share their parameters across time steps to improve generalization.

The most basic RNN can be implemented as a single layer with an affine transformation and non-linearity:

β„Žπ‘‘ =tanh(π‘Š β„Žπ‘‘βˆ’

1+π‘ˆ π‘₯

𝑑+𝑏) , (2.5)

whereπ‘Š andπ‘ˆare trainable linear transformations and𝑏is a trainable bias term.

However, it has been shown that such basic RNNs suffer from gradient problems [41–44] due to multiplicative updates and non-linearities in the backpropagation path to an early state. In practice, gated RNNs are used, such as the Long Short-Term Memory (LSTM [42]) unit or the Gated Recurrent Unit (GRU [45]). Both implementadditivegated state updates, which prevents the gradients from vanishing or exploding due to linear transformations and repeated non-linearities. The GRU is defined according to the following equations and is illustrated in Figure2.1:

z𝑑 =𝜎(π‘Š

𝑧x𝑑+π‘ˆ

𝑧hπ‘‘βˆ’1+b𝑧) (2.6)

r𝑑 =𝜎(π‘Š

π‘Ÿx𝑑+π‘ˆ

π‘Ÿhπ‘‘βˆ’1+bπ‘Ÿ) (2.7)

hˆ𝑑 =tanh(π‘Šx𝑑+π‘ˆ(hπ‘‘βˆ’1βŠ™r𝑑) +bβ„Ž) (2.8) h𝑑 =(1βˆ’z𝑑) βˆ—hπ‘‘βˆ’1+z𝑑 βˆ—hˆ𝑑 , (2.9) where theπ‘Š

Γ—andπ‘ˆ

Γ—are trainable matrices and𝑏

Γ—are trainable bias terms.

The current hidden stateh𝑑 at time step𝑑of the RNN (which is also its output at time𝑑) is computed by interpolating between the statehπ‘‘βˆ’

1at previous time step and the candidate state Λ†h𝑑(Equation2.9), withz𝑑the update vector andΒ·the element-wise vector product. Theupdate gatez𝑑for the interpolation is computed using the current inputx𝑑 and the previous statehπ‘‘βˆ’1 (Equation2.6), whereπ‘Š

𝑧 and π‘ˆπ‘§ are parameter matrices to be learned during training and𝜎 is the sigmoid activation function 𝜎(π‘₯) = 1/(1+π‘’βˆ’

π‘₯) applied element-wise to the vector entries. The current candidate state Λ†h𝑑 is computed based on the current inputx𝑑and previous statehπ‘‘βˆ’1(Equation2.8), whereWandUare parameter matrices,tanhis the hyperbolic tangent activation function andr𝑑 is the value of thereset gate, computed as in Equation2.7, with parameter matricesWπ‘Ÿ andUπ‘Ÿ. A schematic representation of the GRU can be found in Figure2.1.

The advantage of using gated units such as GRU or long short-term memory (LSTM) [42] is their ability to better process longer sequences, which arises from their additive manipulation of the state vector and explicit filtering using gates. In the case of the GRU, the reset gater𝑑 determines which parts of the previous statehπ‘‘βˆ’

1are β€œignored” in the computation of the candidate state and the update

2.2 Neural networks for sequence processing gatez𝑑 determines how much of the previous state is β€œleaked” into the current stateh𝑑. The update gate could decide to forget the previous state altogether or to simply copy the previous state and ignore the current input. Both gates are parameterized (and thus trainable) and their values depend on both the inputx𝑑and the previous state.

The LSTM is similar but uses two β€œstate” variablesc𝑑andy𝑑, wherec𝑑is the β€œcell state” andy𝑑 is the previous output state. The LSTM is defined according to the following equations:

i𝑑 =𝜎(π‘Š

𝑖x𝑑+π‘ˆ

𝑖yπ‘‘βˆ’1+b𝑖) (2.10)

f𝑑 =𝜎(π‘Š

𝑓x𝑑 +π‘ˆ

𝑓yπ‘‘βˆ’1+b𝑓) (2.11)

o𝑑 =𝜎(π‘Š

π‘œx𝑑+π‘ˆ

π‘œyπ‘‘βˆ’1+bπ‘œ) (2.12)

cˆ𝑑 =tanh(π‘Š

𝑐x𝑑 +π‘ˆ

𝑐yπ‘‘βˆ’1+b𝑐) (2.13)

c𝑑 =f𝑑 βŠ™cπ‘‘βˆ’1+iπ‘‘βŠ™cˆ𝑑 (2.14)

y𝑑 =oπ‘‘βŠ™tanh(c𝑑) , (2.15)

where theπ‘Š

Γ—andπ‘ˆ

Γ—are trainable matrices andbΓ—are trainable bias terms.

Note that in both, there is a mechanism for the network to β€œforget” parts of the previous state. Also, in both cases, state updates are additive, which allows the gradient to be backpropagated while being only element-wise multiplied with the forget gate vectors.

Convolutional Neural Networks

While typically, RNNs and Transformers are used, convolutional neural networks (CNNs) can also be applied on text. A 1D CNN is parameterized using a tensorπ‘Š ∈Rπ‘™Γ—π‘‘Γ—β„Ž, where𝑙is the window size or kernel size,𝑑is the input dimension andβ„Žthe output dimension. The output is computed by sliding the weightsπ‘Šover the length dimension of the input (the input is shaped asR

𝑠×𝑑

, multiplying every 𝑙×𝑑window from the input with everyβ„Ž-slice of size𝑙×𝑑ofπ‘Šusing element-wise multiplication, and summing up. The result is inRπ‘ Γ—β„Ž. Additionally, CNNs use non-linearities, where a ReLU is a standard choice, as well as local (and global) pooling.

Attention

Originally, attention was proposed as a way to produce a latent alignment for neural machine translation [46,47]. Since then, it has been a crucial component in many NLP applications and is the primary mechanism of context integration in transformers.

Most generally, given some input sequence𝑋 of length𝐿of vectorsx𝑖 ∈R

𝑑

and aqueryvector q∈Rβ„Ž, attention is a function that comparesqto everyx𝑖and builds a weighted sum ofx𝑖that can be trained to contain most relevant information from𝑋 pertaining to the given queryq. We can describe

Chapter 2 Background and Preliminaries

the attention mechanism using the following equations:

π‘Žπ‘–=comp(x𝑖,q) (2.16)

𝛼𝑖= 𝑒

π‘Ž 𝑖

Í𝐿

𝑖=0𝑒

π‘Žπ‘– (2.17)

v=

𝐿

βˆ‘οΈ

𝑖=0

𝛼𝑖·x𝑖 , (2.18)

where comp(Β·):R

𝑑×R

β„Žβ†¦β†’Ris a function comparing two vectors.

One variant of attention uses a feedforward network on the concatenation ofx𝑖 andq[46]:

comp(x,q) =wTtanh(π‘Šx+π‘ˆq) (2.19) Another commonly used variant, usually calledmultiplicativeattention, simply uses the dot product as comp(Β·):

comp(x,q)=qTx . (2.20)

The queryπ‘žand key vectorsπ‘₯can also be projected before the dot product:

comp(x,q) =(π‘Šq)T(π‘ˆx) , (2.21)

whereπ‘Š andπ‘ˆare trainable matrices.