• Keine Ergebnisse gefunden

Chapter 3 Slot Filling

4.2 Attention-based Models

a certain sentence. The examples from the Biomedical domain show that the language is very specific with many technical terms, which may pose challenges to statistical models.

4.2 Attention-based Models

Since uncertainty is often expressed by specific trigger patterns (hedge cues, see Sec-tion 4.1.1), such as “may have”, “probably”, “hardly” or “has been suggested”, we in-vestigate attention-based models for uncertainty detection. In particular, we train CNNs and RNNs with gated recurrent units (GRU) and integrate different attention layers.

Although attention layers have been quite successful in NLP (see Section 2.2.3 or Sec-tion 4.7), the design space of architectures with attenSec-tion layers has not been fully explored.

Therefore, we investigate different dimensions of attention in this thesis and develop novel ways to calculate attention weights and integrate them into neural networks. In particu-lar, we make a first attempt to systematize the design space of attention and investigate three dimensions of this space: weighted vs. unweighted selection, sequence-agnostic vs.

sequence-preserving selection, and internal vs. external attention. Our models are moti-vated by the characteristics of the uncertainty detection task. However, we believe that they can be useful for other NLP tasks as well since, for example, weighting can increase the flexibility and expressivity of a neural network, and external resources can add infor-mation which can be essential for good performance. Moreover, word order is often critical for meaning and can, therefore, be an important feature for different NLP tasks.

4.2.1 Overview of Model

Figure 4.1 depicts a high-level view of our model. The input sentence is tokenized using Stanford CoreNLP(Manning et al., 2014) and each token is represented by a word embed-ding. As before, we use word embeddings trained on Wikipedia (see Section 2.2.3). The resulting word embedding matrix is then either processed by a CNN+3-max-pooling layer or by a bidirectional gated RNN (RNN-GRU) with gradient clipping to avoid exploiding gradients. Then, attention is applied, either on the word embedding matrix or on the CNN/RNN-GRU output (indicated by dashed lines in Figure 4.1; cf., Section 4.2.2). Since k-max pooling (CNN) and recurrent hidden layers with gates (RNN-GRU) have strengths complementary to attention, we experiment with concatenating the attention information a ∈ RA to the neural sentence representations r ∈ Rm which is either the CNN pool-ing result or the last hidden state of the RNN-GRU. The final hidden layer then has the following form:

h= tanh(Waha+Wrhr+b) (4.1)

with Wah ∈ RH×A and Wrh ∈ RH×m being weight matrices and b ∈ RH being the bias vector learned during training.

72 4. Uncertainty Detection

w1 w2 … wc-1 wc

input sentence project into

embedding space

a r

attention CNN/GRU

Wrh Wah

h 0 | 1

Figure 4.1: Network overview: combination of attention and CNN/RNN-GRU output. For details on attention, see Figure 4.2.

4.2.2 Dimensions of Attention Design Space

In this section, we identify three dimensions of the design space of attention. Furthermore, we propose new architectures along those dimensions.

Weighted vs. Unweighted Selection

Pooling, as widely applied in CNN models, can be seen as unweighted selection: it extracts the average or maximum values without applying any weights to them. In contrast, atten-tion can be thought of as a weighted selecatten-tion mechanism: The input elements are weighted by the attention weights, which allows the model to focus on a few highly weighted elements and ignore other elements, which have received weights close to zero. The advantage of weighted selection is that the model learns to decide based on the input how many values it should select. In contrast, pooling either selects all values (average pooling) or k values (k-max pooling). Thus, if we apply pooling to uncertainty detection and there are more than k uncertainty cues in a sentence, k-max pooling is not able to focus on all of them.

Sequence-agnostic vs. Sequence-preserving Selection

Attention is generally implemented as a weighted average of input vectors:

a=X

i

ai (4.2)

with ai ∈ RA being the weighted input vectors: ai = αi ·Xi> (see Equation 2.21). As in Equation 2.21, αi denotes the attention weight and Xi is the vector which should be

4.2 Attention-based Models 73 weighted by attention. We call this sum an average because the αi are normalized to sum to 1 (see Equation 2.22) and the standard term for this is “weighted average”. In this work, we introduce a variant of this: k-max average attention:

a= X

rankjj)≤k

aj (4.3)

where rankjj) is the rank of αj in the list of attention weights α in descending order.

This type of averaging may be more robust than the normal average over all elements because elements with low weights (which may just be noise) will be ignored. Note that the weights α are still normalized to sum to 1 for the whole sequence, not just for the k selected values.

Taking an average of input values implies that all ordering information from the input is lost and cannot be recovered by the next layer. Average pooling or max pooling is also sequence-agnostic. However, we argue that order information is needed for some NLP classification tasks. An example is when negation or uncertainty scopes need to be considered. For uncertainty detection, for instance, the order of the input can help distinguish phrases like “it is not uncertain that X is Basque” and “it is uncertain that X is not Basque”. For pooling, there exists a variant which is (at least partly) sequence-preserving: k-max pooling (Kalchbrenner et al., 2014) (see Section 2.2.3). It outputs the subsequence with the highest elements of the input sequence in their original order.

Similarly, we propose two sequence-preserving ways of attention. The first one is k-max sequence:

a= [aj|rank

jj)≤k] (4.4)

where [aj|P(aj)] denotes the subsequence of sequence A = [a1, . . . ,aJ] from which mem-bers not satisfying predicate P have been removed. Note that the resulting sequence a ∈ RA×k is in the original order of the input, i.e., not sorted by value. The second sequence-preserving attention method is k-max pooling. It ranks each dimension of the vectors individually, thus the resulting values can stem from different input positions.

This is the same as standard k-max pooling in CNNs except that each vector element in aj has been weighted (by its attention weight αj), whereas in standard k-max pooling it is considered as is. Below, we also refer to k-max sequence as “per-pos” (since it selects all values of the k positions with the highest attention weights) and to k-max pooling as

“per-dim” (since it selects the k largest weighted values per dimension) to distinguish it from k-max pooling done by the CNN. Note that CNNs and RNNs capture some infor-mation of sequence as well in their outputs: The length of sequences CNNs are able to handle is limited to the width of their filters. RNNs can theoretically capture information of longer sequences as well, however all information gets conflated in their hidden state with a limited storage capacity. In contrast, sequence-preserving attention explicitly stores the most relevant inputs in their order of appearance.

Internal vs. External Attention

When designing the attention layer, we distinguish betweenfocus and source of attention.

74 4. Uncertainty Detection The focus of attention is that layer of the network which is reweighted by attention weights, corresponding to X in Equation 2.21. We investigate two options in this thesis:

focus on the input, i.e., the matrix of word vectors, and focus on the hidden representation, i.e., the output of the convolutional layer of the CNN or the hidden layers of the RNN-GRU. For focus on the input, we first apply tanh to the word vectors to improve results (see Figure 4.2).

The source of attention is the information source used to compute the attention weights, corresponding to the input of e in Equation 2.22.

Note that source and focus of attention are slightly related to keys and values used in memory networks (Miller et al., 2016): Similar to sources, keys are used to calculate weights. Similar to focuses, values are weighted and passed to the next layer of the network.

However, while there is only one key for each memory cell, our conceptualization allows several sources. Moreover, we do not group specific sources and focuses to pairs like key-value pairs. In contrast, sources and focuses are independent of each other. Finally, there is no need of storing focus vectors in a memory.

For attention in the literature, both focus and source are based only on information internally available to the network (through input or hidden layers). We call this internal attention.2 This is also formalized by Equation 2.23 and Equation 2.24 with X being the internal information used for focus and source of attention. Attention in machine translation or question answering, for example, computes attention weights by comparing two vectors, namely the source sentence to the previously translated target representation or the document to the question representation (see Equation 2.25). In this case, the source of attention (X and c) contains more information than the focus (X), however, all those information are given by the intermediate hidden or output representations (previous translations) or the task input (question). We, therefore, also categorize this as internal attention.

In this work, we investigate increasing the scope of the source beyond the input and, thus, making the attention mechanism more powerful. We refer to an attention layer as external attention if its source includes an external resource. For uncertainty detec-tion, for instance, it can be beneficial to give the model a lexicon of seed cue words or phrases. Thus, we provide the network with additional information to bear on identi-fying and summarizing features. This can simplify the training process by guiding the model to recognizing uncertainty cues. The specific external-attention scoring function we use for uncertainty detection is similar to the one used, e.g., in machine translation (see Equation 2.25) with c := Cj being an additional input from an external source. It is parameterized by U1 ∈Rd×A,U2 ∈Rd×E and v ∈Rd and defined as follows:

e(Xi,C) =X

j

v>tanh(U1Xi>+U2Cj) (4.5) whereCj ∈RE is a vector representing a cue phrasej of the training set. We computeCj as the average of the embeddings of the constituting words of j.

2Gates, e.g., the weighting ofht−1in Equation 2.17, can also be viewed as internal attention mechanisms.

4.2 Attention-based Models 75

(1) (2) (3) (4)

w1 w2 … wc-1 wc

input sentence project into embedding space a

tanh W

w1 w2 … wc-1 wc

input sentence project into embedding space

GRU / CNN a W

GRU hidden layers / CNN kernels

w1 w2 … wc-1 wc

input sentence project into embedding space

a

cue embeddings

U1 U2

V

w1 w2 … wc-1 wc

input sentence project into embedding space

a

cue embeddings U1

U2 V

GRU / CNN

Figure 4.2: Internal attention on (1) input and (2) hidden representation. External at-tention on (3) input and (4) hidden representation. For the whole network structure, see Figure 4.1.

Thus, the attention layer consists of a feed-forward hidden layer of size d, which com-pares the input wordXito each cue vectorCj and then sums the results. The weightsU1, U2 and vare learned during training. When applying this function ein Equation 2.22, the attention weights αi show how important each inputXi is for uncertainty detection given the external knowledge about possible cue phrasesCj. The underlying intuition is similar to attention for machine translation, which learns alignments between source and target sentences. Instead, we learn “alignments” between the input and external cue phrases.

Since we use embeddings to represent words and cues, uncertainty-indicating phrases that did not occur in training but are similar to training cue phrases can also be recognized.

For using external attention with the proposed scoring function e to another task, only a set of vectors Cj needs to be created which can be used to determine the relevance of the input embeddings to the task at hand.

Note that external attention is not the only way of incorporating prior knowledge into a neural network model. (An alternative could be the extension of the input word embeddings with flags indicating whether the word is part of a cue phrase or not.) It has rather been designed to improve and speed up the training of the attention module of a neural network. Typically, the attention module is fully unsupervised. In contrast, the cue vectors of external attention provide additional signals to guide the training of the network in the desired direction.

Figure 4.2 shows the four settings arising from our distinction between source and focus of attention: (1) internal attention on the input, (2) internal attention on the hidden representation, (3) external attention on the input, and (4) external attention on the hidden representation. A schematic view of internal and external attention as defined via source and focus is given in Figure 4.3.

76 4. Uncertainty Detection

xi αi

e source

· focus

ai

xi αi

e source

· focus

ai

C source

Figure 4.3: Schemes of focus and source: left: internal attention, right: external attention.

domain train test Wikipedia 11,111 9634 Biomedical 14,541 5003

Table 4.2: Statistics of CoNLL2010 hedge cue detection datasets.