• Keine Ergebnisse gefunden

6.5 Full Pipeline Experiments

6.5.3 Results

Overall Results Tables 6.9 and 6.10 show METEOR and ROUGE scores for our pipeline and all baselines and ablations on both datasets. Compared to the corpus baseline, our model improves performance in terms of ROUGE on both datasets and according to ME-TEOR on Wiki. These task-level results show that the newly proposed components for CM-MDS, which we already demonstrated to be effective in isolation in the previous sec-tions, also form an effective pipeline for the full task of CM-MDS.

However, the lower METEOR scores for our pipeline on Educ contradict with that gen-eral observation. We looked into these results in detail and found that the high scores of the baseline are due to heavy overgeneration during relation extraction, introducing many rather meaningless relations into the concept map. Since METEOR scores can be improved by incorrect relations if they are between a correct pair of concepts, leading to a partial match of the proposition, that undesired overgeneration behavior is rewarded by

Chapter 6. Pipeline-based Approaches

Educ METEOR ROUGE

Pr Re F1 p Pr Re F1 p

Corpus Baseline 15.12 19.49 17.00 .1937 6.03 17.98 8.91 .1156 Improved Pipeline 15.14 17.34 16.12 9.37 11.93 10.38

Ablations Grouping

Lemma 13.93 15.42 14.57 .0023 8.21 8.59 8.25 .0017 CoreNLP 14.14 15.21 14.54 .0077 7.99 6.78 7.26 .0095 Scoring

PageRank 11.78 16.21 13.61 .0005 7.14 11.66 8.66 .0052 Frequency 11.89 16.12 13.65 .0002 7.33 12.09 8.97 .0124 CF-IDF 12.48 16.44 14.15 .0002 7.68 12.08 9.25 .0235 Selection

Heuristic 15.29 17.46 16.26 .6250 9.38 11.88 10.38 .9688 Table 6.9: End-to-end results on the Educ test set for our pipeline and several baselines. P-values are computed with a permutation test that compares F1-scores against our pipeline.

the metric. Hence, the baseline only obtains higher scores by sacrificing the quality of the propositions, introducing many rather uninformative ones.

As suggested in Section 3.5.2, we carried out an additional human evaluation between the two systems to also assess aspects beyond the content-oriented automatic metrics, in-cluding differences in quality discussed above. Following our proposed evaluation protocol, the concept maps for each test topic generated by both approaches were shown to crowd-workers on Amazon Mechanical Turk who were asked for their preference. We collected five preferences for each of the 15 pairs of maps. Table 6.11 presents the results, showing that our concept maps tend to have more meaningful and topic-focused propositions and are especially more grammatical and less redundant than those generated by the baseline.

Concept Grouping To analyze the contribution of our concept grouping approach to the overall result, we compare our pipeline’s performance against theLemmaandCoreNLP ab-lations in Tables 6.9 and 6.10. Both alternatives cause a drop in both metrics on Educ and Wiki, showing that our approach is important for the pipeline’s performance. The alterna-tive methods are more conservaalterna-tive, merging much fewer concept mentions than necessary, but at the same time — in particular theCoreNLP variant — tend to lump too many different mentions together. In contrast, our model can make many more merges relying on

differ-6.5. Full Pipeline Experiments

Wiki METEOR ROUGE

Pr Re F1 p Pr Re F1 p

Corpus Baseline 14.30 23.11 17.46 .5024 6.77 23.18 10.20 .2610 Improved Pipeline 19.57 18.98 19.18 17.00 10.69 12.91

Ablations Grouping

Lemma 18.32 17.24 17.59 .2050 13.99 9.53 11.07 .3270 CoreNLP 16.81 16.63 16.59 .1084 13.09 9.16 10.29 .1554 Scoring

PageRank 13.27 14.13 13.62 .0062 8.35 6.17 7.01 .0097 Frequency 13.44 13.79 13.55 .0071 8.57 7.16 7.61 .0205 CF-IDF 14.63 14.92 14.72 .0189 10.50 7.91 8.87 .0450 Selection

Heuristic 18.22 17.80 17.94 .3008 14.73 9.74 11.51 .3594 Table 6.10: End-to-end results on the Wiki test set for our pipeline and several baselines. P-values are computed with a permutation test that compares F1-scores against our pipeline.

ent types of semantic similarity and at the same time manages to avoid lumping effects by relying on the global partitioning approach.

Importance Estimation The contribution of our supervised scoring model based on Rank-ing SVMs can be seen in Tables 6.9 and 6.10 when comparRank-ing it to the unsupervised abla-tions PageRank, Frequency andCF-IDF. Note that all variants use the same concepts and relations as input and the same ILP-based subgraph selection. Our model clearly outper-forms all alternatives. Looking into the learned weights for our set of features, we observed that the most helpful features are frequencies, in particular document frequency and CF-IDF, and topic relatedness as well as PageRank scores. The fact that these highest-weighted features coincide with the metrics used by the ablations, which we chose based on previous work, shows that we indeed compare our model against the most competitive unsupervised alternatives. To identify unimportant concepts, i.e. assigning low scores, the model makes use of concreteness values and the label’s length.

Concept Map Construction To analyze the effectiveness of our proposed ILP subgraph selection approach, we compare the pipeline against theHeuristic ablation. That compar-ison is essentially equivalent to the experiment carried out in Section 6.4.2 with the only

Chapter 6. Pipeline-based Approaches

Dimension Corpus Baseline Improved Pipeline

Meaning 0.44 0.56

Grammaticality 0.31 0.69

Focus 0.44 0.56

Non-Redundancy 0.21 0.79

Table 6.11: Human preference judgments between concept maps generated on Educ.

difference that predicted importance scores are used instead of gold scores. However, in this end-to-end evaluation, the effectiveness of the ILP approach is less pronounced: While scores on Wiki are higher for our approach, the difference on Educ is only marginal and even slightly prefers the heuristic. Since the ILP is guaranteed to find an optimal solution for the optimization problem, these results seem to be counter-intuitive.

The reason for this observation are errors in the preceding importance estimation step:

The optimal subgraph according to the estimated scores is in this case not the best with regard to the reference summary concept maps, explaining the slightly higher METEOR scores on Educ for the heuristic selection. Despite that behavior on Educ, the ILP is still the superior approach. Looking into the detailed optimization results, we observed that the heuristic found the optimal subgraph for only 35% of the test topics, selecting a subgraph with an on average 0.63% (Educ) and 1.30% (Wiki) lower objective function score in the other cases. To ensure that the optimal subgraphs selected by the ILP also lead to better results in the final evaluation, one would need to improve the importance scoring model.