• Keine Ergebnisse gefunden

Chapter 3. Structured Summarization with Concept Maps

regularly used in conjunction with automatic methods. For MDS, one such method is the pyramid method (Nenkova and Passonneau, 2004, Nenkova et al., 2007), which scores sum-maries after manually matching content units against the reference summary. Other ap-proaches are to show generated summaries to humans, which are then asked to score them or to express a preference between two alternative summaries. Recent examples applying this evaluation include the work of Celikyilmaz et al. (2018), asking for overall summary quality, and Paulus et al. (2018), asking specifically about readability. The DUC conference (Dang, 2005) developed a catalog of dimensions such as grammaticality, non-redundancy, coherence or responsiveness according to which such manual quality judgments are made.

Finally, as none of the aforementioned methods can determine whether generated sum-maries are actually useful, evaluations of type (3) assess that. Typically, these are user studies in which subjects perform a task for which summaries are supposed to be helpful.

Different groups of subjects receive summaries produced by different approaches and the subjects’ performance in the task is compared across groups. While only this type of eval-uation can show whether automatic methods are ultimately useful in practice, it also has challenges: Compared to other evaluations, type (3) requires much more effort, is not easily reproducible and needs a careful study design to avoid confounds. Examples for MDS are Maña-López et al. (2004), McKeown et al. (2005) and Roussinov and Chen (2001).

3.5.2 Proposed Evaluation Methods for CM-MDS

Given the advantages and limitations of each of the evaluation types, we propose to also use a combination of all of them for CM-MDS. In fact, examples for all three categories have previously been used to assess concept map mining techniques. Villalon (2012) and Aguiar et al. (2016) use automatic comparisons against references (type 1), Kowata et al. (2010), Qasim et al. (2013) and Zubrinic et al. (2015) let experts judge the quality of concept maps (type 2) and Rajaraman and Tan (2002) and Valerio et al. (2012) evaluate them in task-based user studies (type 3). The main problems limiting the comparability of that existing work is that the exact procedures and data differ from work to work.

We hope to reduce that problem in the future by proposing standardized evaluation procedures for CM-MDS. In the following, we outline these procedures for evaluations of type (1) and (2). When results of these evaluations are published, subsequent work can simply run the same evaluations following the description here to compare their approach.

For evaluations of type (3), comparisons can typically only be made within one experiment, and we therefore do not give specific recommendations here.

3.5.2.1 Automatic Metrics

In Section 2.2.2, we mentioned the study of McClure et al. (1999), which analyzed different manual scoring methods for concept maps in the context of student testing. They compared

3.5. Evaluation

holistically scoring a map, scoring its structure and scoring each proposition independently, all with a master concept map as the reference. The latter method was found to be most reliable and showed the highest correlation with the true assessment of the student maps (obtained through a more laborious manual analysis). Inspired by these findings, we also designed proposition-level metrics to score concept maps.

Given input documents𝐷, a reference summary concept map𝐺𝑅 = (𝐶𝑅, 𝑅𝑅)and a system-generated summary concept map𝐺𝑆 = (𝐶𝑆, 𝑅𝑆), we create sets of propositions

𝑃𝑥 = { 𝑙(𝑠𝑜𝑢𝑟𝑐𝑒(𝑟𝑖)) 𝑙(𝑟𝑖) 𝑙(𝑡𝑎𝑟𝑔𝑒𝑡(𝑟𝑖)) | 𝑟𝑖 ∈ 𝑅𝑥} (3.3) where𝑥indicates either the reference or system map and the functions𝑠𝑜𝑢𝑟𝑐𝑒and𝑡𝑎𝑟𝑔𝑒𝑡 assign a relation to its source and target concepts. To score a generated concept map, we then calculate the overlap between the sets𝑃𝑆and𝑃𝑅.

For evaluation purposes, we distinguish between two different task settings: One where the size of the summary concept map is restricted by bothℒ𝐶andℒ𝑅and a second where onlyℒ𝐶 is given. In the first setting, similar to the traditional MDS setting, the summary size is bound and most systems will try to fully use that budget. It is therefore sufficient to compute only recall when comparing𝑃𝑆 and𝑃𝑅. In the second setting, while the number of concepts is still constrained, the number of relations and thus the size of𝑃𝑆 can differ.

By reporting both precision and recall, a fair comparison between systems can be made in this setting. In fact, we think that the second setting is more interesting, as it gives models the flexibility to either focus on high precision or high recall.20

METEOR Our first metric is based on METEOR (Denkowski and Lavie, 2014), which has the advantage that it takes synonyms and paraphrases into account21and does not solely rely on lexical matches (as it is the case for ROUGE). For each pair of propositions𝑝𝑠 ∈ 𝑃𝑆 and𝑝𝑟 ∈ 𝑃𝑅 we use METEOR22 to calculate a match score𝑚𝑒𝑡𝑒𝑜𝑟(𝑝𝑠, 𝑝𝑟) ∈ [0, 1]. Then, precision and recall per map are computed as:

𝑃 𝑟 = 1

|𝑃𝑆| ∑

𝑝∈𝑃𝑆

𝑚𝑎𝑥{𝑚𝑒𝑡𝑒𝑜𝑟(𝑝, 𝑝𝑟) | 𝑝𝑟 ∈ 𝑃𝑅} (3.4) 𝑅𝑒 = 1

|𝑃𝑅| ∑

𝑝∈𝑃𝑅

𝑚𝑎𝑥{𝑚𝑒𝑡𝑒𝑜𝑟(𝑝, 𝑝𝑠) | 𝑝𝑠 ∈ 𝑃𝑆} (3.5)

20Note that the size restrictions𝐶and𝑅only limit the size of the graph, but not the size of its content.

A possible adversarial attack against the metric would be to create a concept map with very long labels, in the extreme case containing all of𝐷. The recall would be 100%. Obviously, such a concept map is not a summary and the labels would not be concise. It illustrates that automatic evaluation metrics typically have limitations. A comprehensive evaluation should therefore also rely on manual inspections. Note also that this attack does not work in the second setting, where the high recall would come with low precision.

21As determined using WordNet (Miller et al., 1990) and PPDB (Pavlick et al., 2015).

22We use METEOR version 1.5 with default settings.

Chapter 3. Structured Summarization with Concept Maps

The F1-score is the equally weighted harmonic mean of precision and recall. Scores per map are macro-averaged over all instances of a dataset.

ROUGE As a second metric, we compute ROUGE (Lin, 2004). We concatenate all propo-sitions of a map into a single string, 𝑠𝑆 and 𝑠𝑅, and separate propositions with a dot to ensure that no bigrams span across propositions and the metric is therefore independent of how propositions are ordered. We run ROUGE 1.5.523with𝑠𝑆as the peer summary and 𝑠𝑅 as a single model summary to compute the ROUGE-2 score, the most commonly used variant of the ROUGE metric family.

Significance Testing An important step when using automatic metrics is to determine whether a difference in the average scores that two methods A and B achieve on a dataset is meaningful or only due to chance. The smaller the difference is, the more relevant this question becomes. Statistical tests can be used to determine if a difference is in fact signif-icant. Following suggestions for other NLP tasks, we propose to rely on sampling-based tests, which, in contrast to commonly used parametric tests like Student’s t-test, do not make any assumptions on the distribution of the evaluation scores or evaluation data (Dror et al., 2018). Since such assumptions do not hold for many NLP metrics, such as precision, recall and F-scores, using parametric tests is problematic (Yeh, 2000, Dror et al., 2018).

We propose to use apermutation orrandomization test (Noreen, 1989), as also sug-gested for machine translation by Riezler and Maxwell (2005). Let A and B be two alter-native methods with evaluation scores of𝐸𝐴 = (𝑎1, 𝑎2, ⋯ , 𝑎𝑛)and𝐸𝐵 = (𝑏1, 𝑏2, ⋯ , 𝑏𝑛) over𝑛instances of an evaluation dataset. The average performance difference is then

𝑑(𝐸𝐴, 𝐸𝐵) = ∣ 1

𝑛 ∑

𝑎𝑖∈𝐸𝐴

𝑎𝑖− 1

𝑛 ∑

𝑏𝑖∈𝐸𝐵

𝑏𝑖 ∣ . (3.6)

For the randomization test, we create𝑁 samples by swapping evaluation scores between A and B at each position with probability 0.5, yielding 𝑁 new pairs of score lists such as 𝐸𝐴 = (𝑏1, 𝑎2, 𝑏3, ⋯)and𝐸𝐵 = (𝑎1, 𝑏2, 𝑎3, ⋯). The p-value is then defined as

𝑝 = 1 + # samples with𝑑(𝐸𝐴 , 𝐸𝐵 ) ≥ 𝑑(𝐸𝐴, 𝐵𝐴)

1 + 𝑁 . (3.7)

If𝑝is sufficiently small, we reject the null hypothesis that there is no difference between A and B and conclude that A and B differ significantly. The permutation test, in contrast, checks all2𝑛possible ways of swapping scores between𝐸𝐴and𝐸𝐵rather than just draw-ing𝑁samples out of them. While being more accurate, it can be prohibitively expensive to compute for large 𝑛, such that in practice one mostly relies on the approximate

ran-23We run ROUGE with parameters -n 2 -x -m -c 95 -r 1000 -f A -p 0.5 -t 0 -d -a

3.5. Evaluation

domization test. In this thesis, we make use of the permutation test when possible and otherwise rely on the randomization test. We use thresholds of0.01and0.05to determine significance, but also report the actual𝑝-values for greater transparency.

3.5.2.2 Manual Quality Dimensions

For manual evaluations of type (2), we experimented with several setups and recommend using pairwise comparisons of different summary concept maps rather than assessing their quality in isolation. Most recent works on MDS, for instance Celikyilmaz et al. (2018), use this pairwise approach, as it makes it easier to discover differences between summaries.

A difficulty that arises is the layouting of concept maps. Different ways of presenting the concept map graph can greatly influence a rater’s perception of the map’s quality, and since layouting is a non-trivial issue, it is difficult to automatically create layouts for different maps that are “equally good”. As the goal is to evaluate the output of CM-MDS approaches, we do not want the layout quality to influence the evaluation. As the solution, we propose to not present the concept maps in a graphical form at all, but instead as lists of propositions.

Such lists can be easily shown side by side and allow a fair comparison of the content. In addition, when showing propositions, even raters unfamiliar with concept maps can easily perform the evaluation, as a proposition is essentially just a short sentence.

Given two summary concept maps, we show their propositions side by side and ask raters to pick their preferred list according to the following dimensions:

Meaning The sentences should be understandable and express meaningful facts.

Grammaticality The sentences should be grammatical, without missing or unrelated frag-ments or other grammar-related problems that make them difficult to understand.

Non-Redundancy There should be no unnecessary repetition within the list, neither of whole sentences nor of partial facts.

Focus The sentences should be focused; sentences should only contain information that is related to the given topic description.

The dimensions are inspired by the (more fine-grained) list used during the DUC competi-tions (Dang, 2005). The first two dimensions assess the quality of the extraction subtasks by checking whether concept and relation labels form meaningful and grammatical propo-sitions. The third dimension focuses on non-redundancy, assessing whether the grouping subtasks were handled successfully. Finally, the fourth dimension evaluates the summa-rization aspect, i.e. which content has been selected for the summary concept map.

We note that the automatic evaluation described before mainly focuses on whether a summary contains the same content as a reference summary, whereas the procedure de-scribed here captures more aspects of summary quality. However, the evaluation of content

Chapter 3. Structured Summarization with Concept Maps

selection is more limited in this setup: Since raters have no access to the full source docu-ments or a reference summary, but see only a topic description, they can only assess how much of a map’s content is relevant, but do not know if there is other more relevant con-tent. Nevertheless, we believe if both evaluation techniques are used in combination, such shortcomings are compensated as both techniques complement each other very well.