• Keine Ergebnisse gefunden

Chapter 7. End-to-End Modeling Approaches

have longer input sequences and output graphs, a much larger vocabulary on both sides, a larger variety of mentions for the same concept, more tokens in the input that are neither a concept nor a relation mention and a noisier relationship between input and output. The next section carries out experiments on data that exhibits more of these properties.

7.5. End-to-End Experiments

ApproachEduc-TrainEduc-Test METEORROUGEMETEORROUGE PrReF1PrReF1PrReF1PrReF1 CorpusBaseline15.1219.4917.006.0317.988.91 ImprovedPipeline15.1417.3416.129.3711.9310.38 SequenceTransduction DMSyn-Triples-Tuned21.569.4013.0914.801.983.2219.049.7412.6513.472.183.62 EducSyn-Triples-Tuned31.9319.3824.0036.5012.3618.299.727.178.172.560.781.16 EducSyn-Triples-Trans37.1424.6829.2341.9919.3426.1411.628.469.742.660.821.21 EducSyn-Triples-Fixed29.3418.6222.4133.0911.7516.7511.167.758.993.371.121.62 EducSyn-Concepts-Tuned20.4418.6719.4826.5016.7920.508.787.888.251.861.001.29 Sequence-to-Graph DMSyn-Tuned14.8410.3911.9412.893.104.5316.5311.1512.9611.982.453.70 EducSyn-Tuned26.7918.0621.4828.7314.8819.1810.117.758.711.451.001.14 EducSyn-Tuned-Beam25.2721.1823.0025.3121.5323.119.498.178.721.311.351.30 EducSyn-Trans28.2717.8621.7239.1012.8619.1013.758.6610.524.571.552.29 EducSyn-Trans-Beam24.8320.7322.4928.7018.9922.6612.000.9210.373.041.882.31 Table7.4:End-to-endresultsonthetrainandtestsectionsofEducforpipelineandneuralapproaches.Boldfacehighlightsthebest F1-scorepergroup.SequencetransductionresultsarethesameasreportedinTable7.2andTable7.3andrepeatedforeasiercomparison.

Chapter 7. End-to-End Modeling Approaches

do not require

do not require

can apply for

do not require can apply for credit

checks credit credit

loans loans

loans loans credit

loans loans

the credit credit card

(a)Sequence transduction (DMSyn-Triples-Tuned)

provide credit cards to

can provide can provide

credit cards to credit

loans credit

loans

(b)Sequence-to-graph (DMSyn-Tuned)

Figure 7.4: Summary concept maps predicted for the test topic “students loans without credit his-tory”. Refer to the maps in Figure 6.2 (pipeline) and Figure 4.5 (reference) for comparison. Predicted maps are shown completely. Concepts in green (semantically) match a reference concept.

of the sequence-to-graph model similarly. As a consequence of that, this experimental comparison also makes it hard to draw any conclusion about the superiority of either the sequence transduction or sequence-to-graph model. For both, the version trained on the broader DMSyn transfers best to the test data, with only marginally different scores (not significant). We expect the difference between the two architectures to be more pronounced when they are trained with broader and higher-quality data.

Graph Decoding A main difference between the sequence transduction and sequence-to-graph model is how concept maps are predicted. Due to the ILP decoding, all sequence-to-graphs of the sequence-to-graph model are always guaranteed to be connected and thus no disconnected parts need to be discarded. As a result, the predicted graphs of the model trained on DMSyn have on average 8.6 nodes and 13.7 edges, while the corresponding sequence transduction model produces smaller graphs with only 5.1 nodes and 4.3 edges. The models trained on EducSyn predict graphs of around 11 nodes using both architectures. Here, using the variant of the decoding ILP that chooses among the top-10 beam hypotheses for each node label lets the sequence-to-graph model predict even bigger graphs, having more that 17 nodes on average. However, we observed that this setup often introduces several nodes with very similar labels, i.e. concepts that are redundant and not grouped correctly. Given that the scores of these bigger graphs are also not substantially better on Educ-Test, using the most probable label seems to be preferable. In general, both architectures seem to have problems predicting graphs as large as the reference maps in the current setup.

7.5. End-to-End Experiments

is doing is doing

are doing are getting bullied for

the bullying

your child

the kids the

child

the ones

(a)Sequence transduction (DMSyn-Triples-Tuned)

are are are have killed

to go to

are getting parent

bullies cyber bullying afraid of bullies

and other students bullying

school

other kids

(b)Sequence-to-graph (DMSyn-Tuned) Figure 7.5: Summary concept maps predicted for the test topic “parents dealing with their kids being cyber-bullied”. Map (b) is only shown partially, the complete predicted concept map has 13 concepts and 12 relations. Concepts in green (semantically) match a reference concept.

7.5.3 Qualitative Analysis

To better understand the challenges that our neural models face and to determine whether there is any qualitative difference between the concept maps produced by the two alterna-tive architectures, we also manually looked at the concept maps predicted on the test set topics. Figure 7.4 and Figure 7.5 show summary concept maps that have been predicted by models trained on the DMSyn dataset with the two architectures.

Comparison of Architectures The manual inspection confirmed the aforementioned dif-ference that maps predicted by the sequence-to-graph model tend to be bigger. However, the sizes for both architectures vary a lot across the topics, for instance, the map in Fig-ure 7.4 (b) has only three concepts although it is produced by the sequence-to-graph net-work. For other topics, the sequence transduction model predicted maps with only one concept. A problem specific to the sequence-to-graph model is that it sometimes uses a single relation label for many relations of the map. In Figure 7.5 (b), three relations (seven in the full map) use the label are. We observed that such a behavior occurs when the model does not clearly address specific memory cells, resulting in several cells with similar contents and correspondingly similar labels after decoding. However, apart from these ob-servations, we did not find any consistent qualitative differences that indicate whether one of the two architectures is clearly superior.

Extraction and Grouping The analysis further revealed several issues that occur with both architectures but that are unique to neural models, explaining to some extent the lower quality of the maps compared to those created by the pipeline. Repetitions in concept

Chapter 7. End-to-End Modeling Approaches

labels, such ascredit credit loans loans(Figure 7.4 (a)), and propositions that are rather meaningless (the kids - are getting bullied for - the ones, Figure 7.5 (a)) or not asserted by the input text (parent - have killed - bullying, Figure 7.5 (b)) are part of the predicted summary maps. Since the models can freely generate concept and relation labels token by token, the chance that meaningless and factually wrong information is introduced is higher than in fully extractive approaches. Another issue are duplicate concepts as in Figure 7.5 (a), wherethe child,your childandthe kidsall refer tochildren, which can be attributed to the low quality of the synthetic training data. Since the examples have been created with the pipeline approach, which does not always group concept mentions correctly (see Section 6.5), it is very difficult — if not impossible — for the neural model to learn to correctly group mentions based on these noisy examples.

Summarization With regard to content selection, the predicted summary concept maps show encouraging results. As shown in the example, all included concepts are typically related to the topic, most of them are very central concepts and a large fraction indeed matches the reference map. However, as we mentioned earlier, the predicted concept maps tend to be smaller than the reference maps and it is thus unclear how well content would be selected for bigger maps. Models that are trained on EducSyn show a different behavior (not shown in any of the figures): The models trained on that data tend to include similar concepts in all predicted maps independent of what the input text is and the maps are therefore not very useful. This is a manifestation of the topic shift problem discussed earlier, which makes it difficult for the neural models to predict the topic-specific gold concepts of the test topics as they have not been seen at training time.