2.2 Neural networks for sequence processing
2.2.3 Neural Architectures for NLP
Chapter 2 Background and Preliminaries
the attention mechanism using the following equations:
ππ=comp(xπ,q) (2.16)
πΌπ= π
π π
ΓπΏ
π=0π
ππ (2.17)
v=
πΏ
βοΈ
π=0
πΌπΒ·xπ , (2.18)
where comp(Β·):R
πΓR
ββ¦βRis a function comparing two vectors.
One variant of attention uses a feedforward network on the concatenation ofxπ andq[46]:
comp(x,q) =wTtanh(πx+πq) (2.19) Another commonly used variant, usually calledmultiplicativeattention, simply uses the dot product as comp(Β·):
comp(x,q)=qTx . (2.20)
The queryπand key vectorsπ₯can also be projected before the dot product:
comp(x,q) =(πq)T(πx) , (2.21)
whereπ andπare trainable matrices.
2.2 Neural networks for sequence processing Here,π
π βRππΓπandbo βRππ, in conjunction with the parameters of the encoder network, constitute the trainable parameters of the classification model. In the output layer the classification model typically turns the score vector into a conditional probability distributionπ(π
π|π)over theπ
π relations based on a softmax function, given by
π(π
π|π) = π
π π(π)
Γππ π=0
π
π
π(π) , (2.23)
forπ =1, . . . π
π. Classification is then performed by picking the relation with the highest probability givenπ.
A regression model can be obtained simply by replacing the softmax output layer of the classifier with an MLP whose last layer outputs a single scalar value.
Sequence tagging: A prime example of a sequence tagging task is part-of-speech (POS) tagging, where every word in a given sequence has to be annotated with a label from the predefined set of POS tags (e.g. noun, verb, . . . ). Similarly to sequence classification, we can solve this task by using an encoder, but this time, we generate a representation vector for every token rather than one vector the entire sequence. We write the representation vector for the token at positionπ‘in the sequence asβ
π‘. It is desireable to use an encoder because we want to take into account the context of the words when building their representations in order to improve prediction accuracy. Then, similarly to classification, we can use a classifier (similar to Eq.2.22), but this time at every position rather than just once for the entire sequence. The result is a sequence of tagsπ¦
π‘ of the same length as the input sequence, whereπ¦
π‘
corresponds to the input tokenπ₯
π‘.
Sequence generation and translation: In machine translation and semantic parsing, we need to map the input sequence π₯ = (π₯
1, π₯
2. . . π₯
π) (π₯
π‘ β V
in) to an output sequence (or tree/graph) π¦ = (π¦
1, π¦
2. . . π¦
πΏ) (π¦
π β V
out). Language modeling and translation are both sequence prediction tasks, with the difference that language modeling usually is not conditioned on some input, while translation takes another sequence as the input. Sequence generation is typically done in a left-to-right fashion, where the next prediction takes into account all previously predicted tokens.6 This factorizes the joint probability of the output sequenceπ(π¦|π₯;π) into the productΓπΏ
π=0π(π¦
π|π¦
<π, π₯;π).
Translation can be accomplished by first encoding the input sequence using some encoder, similar to the sequence classification and tagging settings. We then take the encoding of the entire sequence, like in sequence classification, and feed it to the decoder to condition the decoder on the input. If the decoder uses attention (see below), we also take the encoderβs representations for every input position and provide them to the decoder. Different from the previous tasks, we need a decoder model that can be conditioned on some input vectors, and generates a sequence (or another structure). Usually, a decoder model is used that at every step produces the next token in the output sequence. At every step, it takes information about the input, as well as information about what has been decoded so far, and generates the distribution for the next token using a softmax output layer (Eqs.2.22,2.23). In greedy decoding, simply the most probable token from this distribution is taken and appended to the output.
Subsequently, the process is repeated until termination. The decoding loop can be terminated by using
6As we shall see in this work however, this is not the only viable method for decoding sequences (see Section7).
Chapter 2 Background and Preliminaries
special end-of-sequence (EOS) tokens. When the decoder produces an EOS token, the recurrence is stopped and the sequence produced up until the EOS token is returned.
RNN-based architectures
RNN-based architectures were most commonly used before transformers emerged. RNNs can be used both in encoder-only and sequence-to-sequence settings.
Encoding sequences using RNNβs: First, the input sequence of lengthπΏof symbols fromVis projected to a sequenceπ of vectorsxπ‘ βR
πembusing an embedding layer.
Subsequently, the sequenceπ is encoded using an RNN layer:
βπ‘ =RNN(π₯
π‘, β
π‘β1) . (2.24)
This yields a sequenceπ»of vectorsβ
π‘ βR
πRNN. However, since this way of encodeπ only takes into account the tokensprecedingany tokenπ₯
π‘, usually, bidirectional RNN encoders are used that encode the sequence in parallel in both forward and reverse directions:
hFWDπ‘ =RNNFWD(xπ‘,hπ‘β1) (2.25)
hREVπ‘ =RNNREV(xπ‘,hπ‘+1) (2.26)
hπ‘ = [hFWDπ‘ ,hREVπ‘ ] , (2.27)
where RNNFWDand RNNREVare two different RNNs, the latter consuming the sequence of vectors xπ‘ in a right-to-left order. The outputhπ‘ is simply the concatenation of the outputs of the two RNNs for a certain position in the input.
Such encoders are commonly used for sequence tagging and classification in various NLP tasks.
Translating sequences using RNNs: Recall that the sequence-to-sequence model only has to learn to predict the correct token at every decoding step to generate the output sequence, i.e. it learns the distribution π(π¦
π|π¦
<π, π₯). A simple way to accomplish this using RNNs is to first encode the input sequence (see above), initialize the decoder RNN with the input representation, and run the decoder RNN.
π»=(hπ‘)π‘β [0,π]Γ
N0 =Encoder(π₯) (2.28)
g0=hπ (2.29)
uπ=Embed(π¦
πβ1) (2.30)
gπ=RNN(uπ,gπβ1) (2.31)
π(π¦
π|π¦
<π, π₯)=Classifier(gπ) (2.32)
Λ
π¦π=argmax(π(π¦
π|π¦
<π, π₯)) , (2.33)
where Embed(Β·)is token embedding layer, RNN(Β·)is a (multi-layer) RNN and Classifier(Β·)is an MLP followed by a softmax that produces the probabilities over the output vocabularyV
out.
A big problem with the naive sequence-to-sequence model described in Eqs.2.28-2.33is that all the information about the input sequence must be summarized in a single vector (g
0 =hπ), which
2.2 Neural networks for sequence processing introduces a bottleneck. Using attention solves this problem by allowing everygπ to be directly informed by the most relevanthπ‘. An attention-based decoder can be described as follows:
sπ =Attend(π» ,gπ) (2.34)
Λgπ =[gπ,sπ] (2.35)
π(π¦
π|π¦
<π, π₯) =Classifier(Λgπ) , (2.36)
where Attend(Β·)is an attention mechanism that takes a sequence of input vectors (π»here) and a query vector (gπ here) and produces a summary of the input vectors, as described above.
Transformers
Transformers [18] are an alternative to RNNs for sequence processing. In contrast to RNNs, transformers allow for more parallel training, and offer better interpretability. Especially with the emergence of a wide variety of pre-trained language models, transformers are now ubiquitous in NLP.
Transformer Encoder Layer: A single transformer layer for an encoder consists of two parts: (1) a self-attention layer and (2) a two-layer MLP. The self-attention layer collects information from neighbouring tokens using a multi-head attention mechanism that attends from every position to every position. Withβ(
πβ1)
π‘ being the outputs of the previous layer for positionπ‘, the multi-head self-attention mechanism computesπdifferent attention distributions (one for every head) over theπ‘positions. For every headπ, self-attention is computed as follows:
qπ‘ =π
πhπ‘ kπ‘ =π
πΎhπ‘ vπ‘ =π
πhπ‘ (2.37)
ππ‘ , π = qTπ‘kπ
βοΈ
ππ
πΌπ‘ , π = π
π π‘ , π
Γπ
π=0πππ‘ ,π
sπ‘ =
π
βοΈ
π=0
πΌπ‘ , πvπ , (2.38) whereπ
π,π
πΎandπ
π are trainable matrices. Note that attention is normalized by the square root of the dimensionπ
πof thekπ‘ vectors.
The summariessπ‘(π)for headπ, as computed above, are concatenated and passed through a linear transformation to get the final self-attention vectoruπ‘:
uπ‘ =π
π π
Γ
π=0
sπ‘(π) , (2.39)
whereπ
πis a trainable matrix andβdenotes concatenation.
Chapter 2 Background and Preliminaries The whole transformer layer then becomes:
(uπ‘)π‘β [0,π]Γ
N0=Attention( (hπ‘(πβ1))π‘β [0,π]Γ
N0) (2.40)
hΛπ‘(π) =hπ‘(πβ1)+uπ‘ (2.41)
vπ‘ =π
π΄ReLU(π
π΅hΛπ‘(π)) (2.42)
hπ‘(π) =hΛπ‘(π) +vπ‘ , (2.43)
whereπ
π΄andπ
π΅ are trainable matrices. Note that we did not discuss normalization layers and dropout. Different placements of these layers are possible, for example, placing the layer normalization in the beginning of the residual branch and placing dropout at the end of the residual branch.
Transformer Decoder Layer: The decoder should take into account the previously decoded tokens, as well as the encoded input. To do so, two changes are made to the presented transformer encoder layers: (1) a causal self-attention mask is used in self-attention and (2) a cross-attention layer is added.
Thecausal self-attention maskis necessary to prevent the decoder from attending to future tokens, which are available during training but not during test. As such, this is more of a practical change.
Thecross-attention layerenables the decoder to attend to the encoded input. Denoting the encoded input sequence vectors asxπ, cross-attention is implemented in the same way as self-attention, except the vectors used for computing the key and value vectors are thexπvectors. The decoder layer then becomes:
(uπ‘)π‘β [0,π]Γ
N0=Attention( (hπ‘(πβ1))π‘β [0,π]Γ
N0) (2.44)
hΛπ‘(π) =h(π‘πβ1)+uπ‘ (2.45)
(zπ‘)π‘β [0,π]Γ
N0=Attention( (xπ)πβ [0,π π₯]Γ
N0),(hΛπ‘(πβ1))π‘β [0,π]Γ
N0) (2.46)
Λhπ‘(π) =hΛ(π‘πβ1)+zπ‘ (2.47)
vπ‘ =π
π΄ReLU(π
π΅Λhπ‘(π)) (2.48)
hπ‘(π) = Λh(π‘π) +vπ‘ , (2.49)
Position information: The self-attention layers are oblivious to the ordering of the key vectors.
For this reason, it is essential to explicitly add position information to the model. One option is to include non-trainable sinusoid vectors to the token embeddings before feeding them into transformer layers [18]. Another option is to learn position embeddings and add these to the token embeddings before feeding them into the transformer layers. This is done in BERT [7], for example. These two options use absolute position information. In contrast to this, Shaw et al. [48] propose a relative position encoding method that is invariant to the absolute position in the sequence.