• Keine Ergebnisse gefunden

Chapter 2 Background and Preliminaries The whole transformer layer then becomes:

(u𝑑)π‘‘βˆˆ [0,𝑇]Γ‘

N0=Attention( (h𝑑(π‘™βˆ’1))π‘‘βˆˆ [0,𝑇]Γ‘

N0) (2.40)

hˆ𝑑(𝑙) =h𝑑(π‘™βˆ’1)+u𝑑 (2.41)

v𝑑 =π‘Š

𝐴ReLU(π‘Š

𝐡hˆ𝑑(𝑙)) (2.42)

h𝑑(𝑙) =hˆ𝑑(𝑙) +v𝑑 , (2.43)

whereπ‘Š

𝐴andπ‘Š

𝐡 are trainable matrices. Note that we did not discuss normalization layers and dropout. Different placements of these layers are possible, for example, placing the layer normalization in the beginning of the residual branch and placing dropout at the end of the residual branch.

Transformer Decoder Layer: The decoder should take into account the previously decoded tokens, as well as the encoded input. To do so, two changes are made to the presented transformer encoder layers: (1) a causal self-attention mask is used in self-attention and (2) a cross-attention layer is added.

Thecausal self-attention maskis necessary to prevent the decoder from attending to future tokens, which are available during training but not during test. As such, this is more of a practical change.

Thecross-attention layerenables the decoder to attend to the encoded input. Denoting the encoded input sequence vectors asx𝑖, cross-attention is implemented in the same way as self-attention, except the vectors used for computing the key and value vectors are thex𝑖vectors. The decoder layer then becomes:

(u𝑑)π‘‘βˆˆ [0,𝑇]Γ‘

N0=Attention( (h𝑑(π‘™βˆ’1))π‘‘βˆˆ [0,𝑇]Γ‘

N0) (2.44)

hˆ𝑑(𝑙) =h(π‘‘π‘™βˆ’1)+u𝑑 (2.45)

(z𝑑)π‘‘βˆˆ [0,𝑇]Γ‘

N0=Attention( (x𝑖)π‘–βˆˆ [0,𝑇 π‘₯]Γ‘

N0),(hˆ𝑑(π‘™βˆ’1))π‘‘βˆˆ [0,𝑇]Γ‘

N0) (2.46)

˜h𝑑(𝑙) =hΛ†(π‘‘π‘™βˆ’1)+z𝑑 (2.47)

v𝑑 =π‘Š

𝐴ReLU(π‘Š

𝐡˜h𝑑(𝑙)) (2.48)

h𝑑(𝑙) = ˜h(𝑑𝑙) +v𝑑 , (2.49)

Position information: The self-attention layers are oblivious to the ordering of the key vectors.

For this reason, it is essential to explicitly add position information to the model. One option is to include non-trainable sinusoid vectors to the token embeddings before feeding them into transformer layers [18]. Another option is to learn position embeddings and add these to the token embeddings before feeding them into the transformer layers. This is done in BERT [7], for example. These two options use absolute position information. In contrast to this, Shaw et al. [48] propose a relative position encoding method that is invariant to the absolute position in the sequence.

2.3 Training 2.3.1 Maximizing Likelihood

Maximum Likelihood Estimation (MLE) is a training method that maximizes the likelihood function L (πœƒ|D)whereπœƒare the parameters andDthe data. For discrete data, the likelihood of the parameters given data is equal to the probability of the data given the parameters: L (πœƒ|D)=𝑝(D |πœƒ), with the difference that for the likelihood, the parameters are varied while for the probability the data are varied. Essentially, MLE methods find the parameters for a given model such that the probability of the observed data under this model are maximized:

πœƒβˆ—=argmaxπœƒβˆˆΞ˜L (πœƒ|D)=argmaxπœƒβˆˆΞ˜π‘(D |πœƒ) . (2.50) In the case of discriminative models, which map some inputπ‘₯(e.g. a sentence) to an output𝑦(a e.g. a sentiment label), MLE can be written as follows:

πœƒβˆ—=argmaxπœƒβˆˆΞ˜E(π‘₯ , 𝑦)∼DL (πœƒ|π‘₯ , 𝑦) =argmaxπœƒβˆˆΞ˜E(π‘₯ , 𝑦)∼D𝑝(𝑦|π‘₯;πœƒ) . (2.51) Consider a simple multi-class classification network. In that case, 𝑝(𝑦|π‘₯;πœƒ) is the probability distribution over all possible classes produced by the model for a given input π‘₯ and parameters πœƒ. Maximizing the likelihood function is then maximizing the value of 𝑝(𝑦|π‘₯;πœƒ) for the correct class. In practice, the log-likelihood is maximized due to better numerical stability. Maximizing the log-likelihood corresponds to minimizing the cross-entropy between 𝑝(𝑦|π‘₯;πœƒ) and the β€œtrue”

distribution where 100% probability is concentrated in the correct class. Minimizing the cross-entropy is also equivalent to minimizing the Kullback-Leibler (KL) divergence between the two distributions.

Training classifiers

Given a data set of𝑁pairs of NLQs and single fact based formal queriesD= n

π‘ž(

𝑖)

, 𝑓(

𝑖) o𝑁

𝑖=1, (here, 𝑓(

𝑖)

is an entity-relation tuple: 𝑓(

𝑖)

=(𝑒(

𝑖)

, π‘Ÿ(

𝑖)

)), and a relation classification model can be trained by maximizing the log-likelihood of the model parametersπœƒ, which is given by

𝑁

βˆ‘οΈ

𝑖=1

log𝑝

πœƒ(π‘Ÿ(

𝑖)

|π‘ž(

𝑖)

) , (2.52)

whereπ‘Ÿ(

𝑖)

is the predicate used in the formal query 𝑓(

𝑖)

.

Training taggers

A common type of task in NLP requires labeling every token in the input sequence. This type of model has many applications in NLP, such as POS tagging and NER, but can also be used in semantic parsing.

Training a tagger using MLE is equivalent to maximizing 𝑝(𝑦|π‘₯) of(π‘₯ , 𝑦) pairs from the given dataD, whereπ‘₯is a sequence of words of length𝐿and𝑦is a sequence of tags of the same length.

The most basic method for training the tagger assumes independence between the different tags and

Chapter 2 Background and Preliminaries factorizes𝑝(𝑦|π‘₯)as follows:

𝑝(𝑦|π‘₯) =

𝐿

Γ–

𝑖=1

𝑝(𝑦

𝑖|π‘₯) . (2.53)

Relations between tags can be taken into account by using a conditional random field (CRF) on top of the encoder.

Training sequence generators

When the output structure is a sequence, the joint probability of the output sequence under a given model can be decomposed as the product of probabilities of each token in the sequence given all preceding tokens:

𝑝(𝑦|π‘₯) =

𝐿

Γ–

𝑖=0

𝑝(𝑦

𝑖|𝑦

<𝑖, π‘₯)= 𝑝(𝑦

1|π‘₯)𝑝(𝑦

2|𝑦

1, π‘₯)𝑝(𝑦

3|𝑦

2, 𝑦

1, π‘₯). . . 𝑝(𝑦

𝐿|𝑦

< 𝐿, π‘₯) , (2.54) where𝐿is the number of elements in𝑦.

Similarly to training a tagger, we maximize𝑝(𝑦|π‘₯)for(π‘₯ , 𝑦)pairs from the given dataD. Different from training a tagger, the input sequenceπ‘₯and the predicted output𝑦may have different lengths. The simplest way to train a sequence generator is usingteacher forcing, where we take the ground truth 𝑦to feed as input to the decoder (in the𝑦

<𝑖 variable) when training it. Note that we can not easily use the model’s own predictions during training since it might not always be possible to determine what should happen if the generated token is different than the ground truth token. However, in some cases, it may be possible to define dynamic oracles that can generate the best action sequences for all possible states the decoded structure migh be in7A possible disadvantage of teacher forcing is that the model only gets exposed to sequences given in the data during training, while during generation, entirely new combinations may be generated, on which the network might be more likely to fail.

Note that while this left-to-right autoregressive model is standard in NLP for machine transla-tion, semantic parsing and other sequence generating tasks, other factorizations and independence assumptions in the modeling of the joint probability𝑝(𝑦|π‘₯)are possible [23].

2.3.2 Stochastic Gradient Descent

Deep neural networks are trained by minimizing a loss w.r.t. the parameters using Stochastic Gradient Descent (SGD). SGD-based training iterates over the training data, processing a subset of the examples at a time, and updating the parameter values at every such step. The subset of the training data used in an update step is called abatch, and a whole pass over the training data is called anepoch.

SGD minimizes the loss by computing the gradient of the loss w.r.t. the parameters and taking a step against the gradient (in the direction that would lower the loss value) in every update step.

Computation of the gradient is achieved usingbackpropagation, which can be seen as a special case of reverse mode auto-differentiation, applied on neural networks. Where symbolic differentiation

7Imagine for example, a decoder that in addition to generating tokens also can produce actions to erase its previously generated token. In this case, an oracle can be constructed that would train the decoder to use eraser actions when the generated tokens deviate from the ground-truth

2.4 Pretraining and transfer learning