• Keine Ergebnisse gefunden

Chapter 3 Slot Filling

4.5 Analysis

82 4. Uncertainty Detection

of the

photosphere of 47 Ursae

Majoris suggested

that the periodicity

could not

be

explained by stellar

activity , making

the planet

interpretation more

likely .

Observations

Figure 4.4: External attention weight heat map.

Alternatively , the

islandwas sometimes

known as Brazil ,

and so might represent

the same

island as the Brazil

off the

west coast of

Ireland . external

internal pooling

Figure 4.5: Pooling vs. internal vs. external attention.

Pooling vs. Internal vs. External Attention

For a qualitative analysis of the different selection mechanisms (pooling, internal attention, external attention), we randomly pick sentences from the Wikipedia test set and extract which parts of the input they select. For internal and external attention, this is straight-forward since we can directly plot the attention weights αi. For pooling, we calculate the relative frequency that a value from an n-gram centered around a specific word is picked.

In particular, we divide the absolute frequency by the total number of pooling results (for k-max pooling, this is k times the number of convolutional filters). Figure 4.5 shows the results of the three mechanisms for an exemplary sentence. We provide figures for more sentences in the appendix (Section B.2). The observable patterns are similar across all sentences we have randomly picked: Pooling forwards information from different parts all over the sentence. It has minor peaks at relevant n-grams (e.g., “was sometimes known as”

or “so might represent”) but also at non-relevant parts (e.g., “Alternatively” or “the same island”). There is no clear focus on uncertainty cues. Internal attention is more focused on the relevant words for uncertainty detection. External attention finally has the clearest focus since its training is guided by prior knowledge of cue phrases.

4.5.2 Analysis of CNN vs. RNN-GRU

The behavior of CNN and RNN-GRU is different on the two domains Wikipedia and Biomedical. While the results of the CNN and the RNN-GRU are comparable on Biomed-ical, the CNN clearly outperforms the RNN-GRU on Wikipedia. Table 4.8 shows a com-parison of characteristics of the two datasets that might affect model performance.

Although the Wikipedia dataset has a richer vocabulary (almost twice as many different words as the Biomedical dataset), it is better covered by our word embeddings, probably because they have also been trained on Wikipedia. Thus, the average number of

out-of-4.5 Analysis 83

Wikipedia Biomedical

average sentence length8 21 27

size of vocabulary 45,100 25,300

average #OOVs per sentence 4.5 6.5

Table 4.8: Differences of Wikipedia and Biomedical dataset.

0 0.2 0.4 0.6 0.8 1

0 50 100 150 200

F1 score

length of sentence CNN

GRU

Figure 4.6: F1 results for different sentence lengths.

vocabulary (OOV) words per sentence is lower. Also, the sentences are shorter on average.

All of those features can influence model performance, especially because of the different way of sentence processing: While the RNN-GRU merges all information into a single vector, the CNN extracts the most important phrases and ignores all the rest. In the following paragraphs, we analyze the behavior of the two models with respect to sentence length, number of OOVs and precision and recall scores.

Sentence Lengths

Figure 4.6 shows the F1 scores on Wikipedia of the CNN and the RNN-GRU with external attention for different sentence lengths. For a better overview, we discretize the lengths to intervals of 10, i.e., index 50 on the x-axis includes the scores for all sentences of length l ∈ [50,60). Most sentences (96.2%) have lengths l < 50. Only 0.1% of sentences have lengthl >100. The CNN outperforms the RNN consistently across sentence lengths, with larger differences for longer sentences.

Number of Out-of-Vocabulary Words

Figure 4.7 shows a similar plot for F1 scores depending on the number of OOVs per sen-tence. Again, the CNN consistently outperforms the RNN-GRU independent of the number of OOVs. This indicates that the uncertainty detection task seems to be more challenging for the RNN-GRU in general, not depending on the number of out-of-vocabulary words.

8after tokenization with StanfordCoreNLP.

84 4. Uncertainty Detection

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100 120 140

F1 score

Number of OOVs

CNN GRU

Figure 4.7: F1 results for different number of OOVs per sentence.

model P R

CNN 52.5 85.1

CNN + external attention 58.6 78.3

RNN-GRU 75.2 49.6

RNN-GRU + external attention 76.3 52.0

Table 4.9: Precision and recall scores of CNN and RNN-GRU on Wikipedia.

Precision and Recall

So far, we have only compared F1 scores of the different models. In this paragraph, we investigate precision and recall of the CNN and the RNN-GRU. Table 4.9 provides the values for four models on the Wikipedia dataset: CNN and RNN-GRU with and without external attention.

The scores show a very important difference between the two models: The CNN models offer high-recall predictions while the predictions of the RNN-GRU models have higher precision. This suggests that the RNN-GRU predicts uncertainty more reluctantly than the CNN. The same analysis on Biomedical reveals that the precision and recall values are almost the same for both models. This might be a reason why the performance of the models is similar on Biomedical but different on Wikipedia. A reason for that might be the different characteristics of the two corpora as shown in Table 4.8. The larger vocabulary of the Wikipedia dataset, for example, might pose more challenges to the RNN than to the CNN in identifying uncertain sentences. As a result, its recall is considerably lower on the Wikipedia dataset.

Note that F1 scores can be optimized by tuning the prediction thresholds for the dif-ferent classes. However since the classification task is binary, we do not tune them here but decide for the uncertain or certain class depending on which output score is higher.