• Keine Ergebnisse gefunden

Chapter 4. Creation of Benchmark Corpora

2002 (Al-Sabahi et al., 2018), at 19% on DUC 2002 and 10% on DUC 2003 for MDS (Peyrard and Eckle-Kohler, 2016) and, measured in ROUGE-2 F1-score, at 19% on CNN/DailyMail and 31% on the New York Times corpus (Celikyilmaz et al., 2018), which are both also SDS corpora. A comparison across datasets, as already illustrated by the performance differ-ences between the different traditional summarization corpora, is not very meaningful. A comparison of our baseline results against these state-of-the-art performances is therefore of only very limited value, as not just the data, but also the task and evaluation protocols differ. Nevertheless, it at least indicates that the task represented by our dataset is neither trivially simple nor extremely hard, but in the range of existing summarization work.

A detailed analysis of the single pipeline steps revealed that there is room for improve-ment in all the pipeline steps. First, we observed that only around 76% of gold concepts are covered by the extracted mentions (step 1) and after grouping concepts (step 2), showing that better extraction methods are needed. For relations, the recall is considerably lower.

After estimating the importance of concepts (step 5), the top 25 concepts contain only 17% of the gold concepts. Hence, content selection is a major challenge, stemming from the large cluster sizes in the corpus. The propagation of errors along the pipeline further contributes to low performance. Overall, the baseline experiments confirm that the task is complex and cannot be solved with simple techniques.

4.4. Chapter Summary

Dataset Pairs Concept Map Source

Concepts Tokens Relations Tokens Documents Tokens Educ 30 25.0±0.0 3.2±0.5 25.2±1.3 3.2±0.5 40.5±6.8 97880.0 Biology 183 6.9±4.0 1.2±0.4 3.5±3.0 1.9±1.2 1.0±0.0 2620.9 Wiki 38 11.3±5.2 1.9±0.4 13.8±8.4 5.0±1.2 14.6±3.1 27065.6

ACL 255 10.9±5.5 1.9±0.9 – – 1.0±0.0 4987.5

Table 4.7: Corpus statistics for all benchmark corpora used in the thesis. Values for Biology, Wiki and ACL are the same as in Table 4.2 and are repeated for easy comparison to Educ.

As the second direction, we developed a new corpus creation method that effectively combines automatic preprocessing, scalable crowdsourcing and high-quality expert anno-tations. Its crucial component is a novel crowdsourcing scheme called low-context impor-tance annotation. In contrast to traditional approaches, it allows us to determine important elements in a large document set without requiring annotators to read all documents, mak-ing it feasible to crowdsource the task and overcome quality issues observed in previous crowdworking attempts. We showed that the approach creates reliable data for our sum-marization scenario and, when tested on traditional sumsum-marization corpora, creates anno-tations that are similar to those obtained by earlier data collection efforts. Using this new corpus creation method, we can avoid the high effort for annotators, which allows us to scale to document sets that are 15 times larger than in traditional summarization corpora.

We created a new corpus, Educ, with 30 topics, each with around 40 source documents on educational topics and a summarizing concept map that is the consensus of many workers.

Table 4.7 summarizes the corpora created in this chapter. All of them will be used throughout the remainder of this thesis. While Educ, offering the highest-quality annota-tions, will be the main evaluation corpus, we will use the automatically created Biology and ACL corpora for concept and relation extraction experiments in Section 5.2 and the Wiki corpus as a second dataset for the task-level experiments carried out in Chapter 6.

Chapter 4. Creation of Benchmark Corpora

Chapter 5

Concept and Relation Extraction

In this chapter, we will focus on the CM-MDS subtasks of concept and relation mention extraction. Using the datasets introduced in the previous chapter, we will present a series of experiments that, for the first time, directly compare different extraction approaches pro-posed in previous work. Moreover, we will introduce the idea of using predicate-argument analysis for concept and relation mention extraction and include such methods in the ex-perimental comparison. And finally, as most work on concept maps in the past has focused on the English language, we will dedicate the second part of the chapter to studying how such extraction methods can be ported to other languages.

5.1 Motivation and Challenges

In Section 3.3.1 and Section 3.3.3, we outlined the challenges of concept and relation men-tion extracmen-tion: the variety of expressions that can be used to menmen-tion concepts or relamen-tions in text, the need to find a good trade-off between precision and recall when trying to cover that variety and the desire to design extraction methods that work well across different types of text. A range of extraction techniques has been proposed for this subtask, which we reviewed in Section 2.3.1.1 and Section 2.3.1.2. They all use a syntactic representation of the input text — in the form of part-of-speech tags, constituency parse trees or dependency parses — and extract sequences of tokens that follow certain patterns. Corresponding sets of patterns have been hand-designed to cover the targeted mentions.

We also mentioned before that a weakness of previous work is the lack of comparative experiments, leaving it unclear which of the proposed extraction methods perform best. As the first contribution of this chapter, we therefore carry out such a comparison by reimple-menting representative approaches from previous work and evaluating them on the new corpora introduced in the previous chapter. This will provide valuable insights into the performance of the different approaches and open issues that still have to be addressed.

Chapter 5. Concept and Relation Extraction

As a second contribution, we propose to design mention extraction methods based on predicate-argument structures instead of dependency representations. To illustrate this idea, consider the following example and its syntax as given in a dependency tree:

Herbal supplements can reduce the symptoms of ADHD . root

amod

nsubj

aux det

dobj

prep pobj punct

To extract the concept mentionsHerbal supplementsandthe symptoms of ADHD, patterns ex-tractingnsubj- anddobj-dependencies can be used to find the relevant spans of tokens in the dependency representation. However, these patterns cannot extract anything from the passive variant of the sentence because the tokens now have other grammatical functions:

The symptoms of ADHD can be reduced by herbal supplements . root

det

nsubjpass prep pobj

aux

auxpass prep amod

pobj punct

Additional patterns would be necessary to identify the same concept mentions in the pas-sive sentence. Due to the variety of natural language, such pattern-based approaches are either limited in coverage or require a very large and carefully designed set of patterns to cover every possible way in which propositions of concepts and relations can be expressed.

To eliminate the high effort associated with the pattern definition, we propose to utilize predicate-argument structures instead of more fine-grained representations such as depen-dency trees because the former already abstract away from many syntactic variations. Con-tinuing the example, a representation that simply marks reduceas a predicate andherbal

supplementsandthe symptoms of ADHDas its arguments would be desirable. Being indepen-dent of a specific realization, it would be an appropriate representation of both the active and passive syntactic variant of the example, requiring no separate handling of the cases.

Using such a unified representation based on predicates and arguments, mention extraction approaches no longer need to carefully define large sets of patterns, but can instead make use of existing predicate-argument analysis tools.

Finally, as the third contribution of this chapter, we address the language dependency of concept and relation extraction methods. Most previous work on concept map mining focused on text in English and designed extraction patterns specifically for the syntax of English. Unfortunately, also the predicate-argument analysis tools we propose as an alter-native are mainly focused on English. To gain more insight into how difficult and costly it is to also create predicate-argument analysis tools for other languages, we present a case-study of porting an existing system to German. We discuss the different challenges of such