2.2 Neural networks for sequence processing
2.2.2 Neural Network Components
Neural network architectures are constructed using different neural network building blocks. In this section, we discuss various such building blocks that are relevant to NLP and our tasks.
5After speech transcription, if necessary.
2.2 Neural networks for sequence processing Token representations
Natural language consists of discrete elements, such as words, which, as previously discussed, can be decomposed into sub-word units. In order to use such discrete elements in a model, we need to find vector representations for the tokens.
A simple vector representation for tokens that has long been in use is theone-hotvector representation.
This representation can be formalized as a function π
oneβhot:V β¦βN0| V |that maps any tokenπ€from the vocabularyVonto a vectorπ£of the size of the vocabulary, such that only the element in the vector corresponding to the tokenπ€is equal to one and the others are equal to zero:
π£π = (
1, ifπ=id(π€) 0, otherwise
, (2.2)
where the function id(π€):V β¦βN0maps the wordπ€to its unique integer id.
Since the one-hot representation do not scale well in practice, and the distance between two words is equal for any two words, tokens are usually represented asdensevectors of relatively low dimensionality. The token embedding function is the a function π
emb:V β¦βR
π
, where πis some chosen dimension for the token embedding space. Note that such a dense embedding allows the vector representations of different words to be arbitrarily close to each other, which is useful for handling synonyms. In addition, the dimensionality of the vectors can be kept low and does not directly depend on the vocabulary size. The function π
embcan be implemented as:
πemb(π€) =π΄Β·π£
π€, (2.3)
whereπ£
π€ βN0| V |is a one-hot vector for wordπ€andπ΄βR
πΓ | V |
is a trainable matrix containing the vectors for every word inV.
In the earlier days of deep learning for NLP, several fundamental works have investigated different ways of pre-training word vectors using large unsupervised corpora. Among the most commonly used ones are Word2Vec [39] and GloVe [40]. These methods provide initializations for word vector parameters that capture synonymity and other relations between words, induced from statistics on a large text corpus (see2.4.1). Essentially, similar words obtain similar embeddings, which improves generalization to unobserved words. While recent focus in NLP has shifted towards pretrained transformers (see Section2.4.2), pretrained word embeddings are still used in applications due to their simplicity and efficiency.
Recurrent Neural Networks
Natural languages, as well as formal languages, are sequential in nature, which means that we require models that can process and generalize for different sequence lengths. Recurrent neural networks (RNN) are a type of neural networks suitable for this case. RNNs typically process the sequence one token at a time, and use a βstateβ variable/vector that describes the sequence elements βconsumedβ so far. Thus, the typical RNN is a parameterized function π :R
ππ ΓR
πβ β¦βR
πβ
that computes a new stateβ
π‘ βR
π
β from the previous stateβ
π‘β1 βR
π
β and current inputπ₯
π‘ βR
π π: βπ‘ = π(βπ‘β
1, π₯
π‘;π) (2.4)
Chapter 2 Background and Preliminaries
xt ht-1
U
W Ur
Uz Wr Wz
ht
Figure 2.1: Gated Recurrent Unit (GRU)
RNNs usually share their parameters across time steps to improve generalization.
The most basic RNN can be implemented as a single layer with an affine transformation and non-linearity:
βπ‘ =tanh(π βπ‘β
1+π π₯
π‘+π) , (2.5)
whereπ andπare trainable linear transformations andπis a trainable bias term.
However, it has been shown that such basic RNNs suffer from gradient problems [41β44] due to multiplicative updates and non-linearities in the backpropagation path to an early state. In practice, gated RNNs are used, such as the Long Short-Term Memory (LSTM [42]) unit or the Gated Recurrent Unit (GRU [45]). Both implementadditivegated state updates, which prevents the gradients from vanishing or exploding due to linear transformations and repeated non-linearities. The GRU is defined according to the following equations and is illustrated in Figure2.1:
zπ‘ =π(π
π§xπ‘+π
π§hπ‘β1+bπ§) (2.6)
rπ‘ =π(π
πxπ‘+π
πhπ‘β1+bπ) (2.7)
hΛπ‘ =tanh(πxπ‘+π(hπ‘β1βrπ‘) +bβ) (2.8) hπ‘ =(1βzπ‘) βhπ‘β1+zπ‘ βhΛπ‘ , (2.9) where theπ
Γandπ
Γare trainable matrices andπ
Γare trainable bias terms.
The current hidden statehπ‘ at time stepπ‘of the RNN (which is also its output at timeπ‘) is computed by interpolating between the statehπ‘β
1at previous time step and the candidate state Λhπ‘(Equation2.9), withzπ‘the update vector andΒ·the element-wise vector product. Theupdate gatezπ‘for the interpolation is computed using the current inputxπ‘ and the previous statehπ‘β1 (Equation2.6), whereπ
π§ and ππ§ are parameter matrices to be learned during training andπ is the sigmoid activation function π(π₯) = 1/(1+πβ
π₯) applied element-wise to the vector entries. The current candidate state Λhπ‘ is computed based on the current inputxπ‘and previous statehπ‘β1(Equation2.8), whereWandUare parameter matrices,tanhis the hyperbolic tangent activation function andrπ‘ is the value of thereset gate, computed as in Equation2.7, with parameter matricesWπ andUπ. A schematic representation of the GRU can be found in Figure2.1.
The advantage of using gated units such as GRU or long short-term memory (LSTM) [42] is their ability to better process longer sequences, which arises from their additive manipulation of the state vector and explicit filtering using gates. In the case of the GRU, the reset gaterπ‘ determines which parts of the previous statehπ‘β
1are βignoredβ in the computation of the candidate state and the update
2.2 Neural networks for sequence processing gatezπ‘ determines how much of the previous state is βleakedβ into the current statehπ‘. The update gate could decide to forget the previous state altogether or to simply copy the previous state and ignore the current input. Both gates are parameterized (and thus trainable) and their values depend on both the inputxπ‘and the previous state.
The LSTM is similar but uses two βstateβ variablescπ‘andyπ‘, wherecπ‘is the βcell stateβ andyπ‘ is the previous output state. The LSTM is defined according to the following equations:
iπ‘ =π(π
πxπ‘+π
πyπ‘β1+bπ) (2.10)
fπ‘ =π(π
πxπ‘ +π
πyπ‘β1+bπ) (2.11)
oπ‘ =π(π
πxπ‘+π
πyπ‘β1+bπ) (2.12)
cΛπ‘ =tanh(π
πxπ‘ +π
πyπ‘β1+bπ) (2.13)
cπ‘ =fπ‘ βcπ‘β1+iπ‘βcΛπ‘ (2.14)
yπ‘ =oπ‘βtanh(cπ‘) , (2.15)
where theπ
Γandπ
Γare trainable matrices andbΓare trainable bias terms.
Note that in both, there is a mechanism for the network to βforgetβ parts of the previous state. Also, in both cases, state updates are additive, which allows the gradient to be backpropagated while being only element-wise multiplied with the forget gate vectors.
Convolutional Neural Networks
While typically, RNNs and Transformers are used, convolutional neural networks (CNNs) can also be applied on text. A 1D CNN is parameterized using a tensorπ βRπΓπΓβ, whereπis the window size or kernel size,πis the input dimension andβthe output dimension. The output is computed by sliding the weightsπover the length dimension of the input (the input is shaped asR
π Γπ
, multiplying every πΓπwindow from the input with everyβ-slice of sizeπΓπofπusing element-wise multiplication, and summing up. The result is inRπ Γβ. Additionally, CNNs use non-linearities, where a ReLU is a standard choice, as well as local (and global) pooling.
Attention
Originally, attention was proposed as a way to produce a latent alignment for neural machine translation [46,47]. Since then, it has been a crucial component in many NLP applications and is the primary mechanism of context integration in transformers.
Most generally, given some input sequenceπ of lengthπΏof vectorsxπ βR
π
and aqueryvector qβRβ, attention is a function that comparesqto everyxπand builds a weighted sum ofxπthat can be trained to contain most relevant information fromπ pertaining to the given queryq. We can describe
Chapter 2 Background and Preliminaries
the attention mechanism using the following equations:
ππ=comp(xπ,q) (2.16)
πΌπ= π
π π
ΓπΏ
π=0π
ππ (2.17)
v=
πΏ
βοΈ
π=0
πΌπΒ·xπ , (2.18)
where comp(Β·):R
πΓR
ββ¦βRis a function comparing two vectors.
One variant of attention uses a feedforward network on the concatenation ofxπ andq[46]:
comp(x,q) =wTtanh(πx+πq) (2.19) Another commonly used variant, usually calledmultiplicativeattention, simply uses the dot product as comp(Β·):
comp(x,q)=qTx . (2.20)
The queryπand key vectorsπ₯can also be projected before the dot product:
comp(x,q) =(πq)T(πx) , (2.21)
whereπ andπare trainable matrices.