• Keine Ergebnisse gefunden

Chapter 4. Creation of Benchmark Corpora

atom

molecule element

matter

isotope

neutron

electron

ion

atomic number

proton form

belongs to is

constituent unit of are

in is composed

of

is called

has has

Figure 4.1: Summary concept map from Biology on the topic “atom”.

Quebec agreement

Winston Churchill University of

Birmingham

Rudolf Peierls

Tube Alloys

MAUD report

atomic bomb Klaus

Fuchs World War II

uranium

235 uranium

Manhattan Project

Franklin Roosevelt

Soviet

critical mass

Otto Frisch signed by

worked at

galvanized their American

counterparts with

assessed the

chances of developing

emphatically concluded that had worked

closely on succeeded in

would

require using the fissile isotope

of enriched

in was a joint

effort to build were to acquire …

approved fled to England to

work on

needed for

to create the precise

physical conditions

called

was made by

Figure 4.2: Summary concept map from Wiki on the “British contribution to the Manhattan Project”.

4.2. Automatic Corpus Creation

4.2.2 Using Existing Concept Annotations

As the starting point for Wiki, we use a recently published MDS corpus created by Zopf et al. (2016b). They took the introductory sections of featured Wikipedia articles, which tend to be good summaries of the topic due to the Wikipedia guidelines for featured articles, and matched them with web pages that described different aspects covered by the summary in detail. Their corpus consists of 91 pairs of documents𝐷and a textual summary𝑆.

For our corpus, we make use of the fact that these summaries𝑆, being Wikipedia arti-cles, contain many links to other Wikipedia pages. For each𝑆, we create a set of concepts 𝐶by collecting the names of the linked articles and the main article itself. Since they are linked in the summary, they tend to be important concepts for the topic. Further, we run an existing OIE system26 over the source documents𝐷to extract binary propositions. To construct a summary concept map𝐺for a document set𝐷, we then iterate over all extrac-tions and identify those that mention a concept out of 𝐶in both of their arguments. We apply a range of rules to filter out spurious matches, e.g. concept mentions that are just a small part of a very long argument or extractions containing unresolved pronouns. If an extraction passes all tests, it is added to a set𝑅, forming the concept map𝐺 = (𝐶, 𝑅).

Since the map𝐺has been created based on the set𝐶derived from𝑆, it covers similar content as the summary and is thus an adequate summary for 𝐷. To ensure that it is also connected, as required by Definition 5, we reduce the obtained graph to its biggest connected component. Finally, we remove all pairs where the resulting concept map has fewer than 7 concepts. After these steps, Wiki consists of 38 pairs of documents and a summary concept map. One of them is shown in Figure 4.2.

For the third dataset, ACL, we use the ACL RD-TEC 2.0 corpus (QasemiZadeh and Schu-mann, 2016). It consists of 300 abstracts taken from papers in the ACL Anthology in which two annotators marked concepts. As abstracts are good summaries of a paper, these con-cepts tend to be the central concon-cepts discussed in the papers. We use Apache Tika27 to extract the full texts, excluding the abstracts, from the PDF version of the corresponding papers. We filter out papers where the extraction fails. These texts are then paired with the annotated concepts as the gold concepts. We obtain 255 pairs. Note that we cannot use this corpus to evaluate relation extraction, as such annotations are not available.

4.2.3 Comparison and Limitations

All three datasets, compared in Table 4.2, could be created mostly automatic with minimal manual effort, circumventing the challenges of manual annotation that were discussed at the beginning of this chapter. And, compared to most of the existing datasets presented in

26OpenIE4 (Mausam, 2016), a state-of-the-art system according to Stanovsky and Dagan (2016b).

27https://tika.apache.org/

Chapter 4. Creation of Benchmark Corpora

Dataset Pairs Concept Map Source

Concepts Tokens Relations Tokens Documents Tokens Biology 183 6.9±4.0 1.2±0.4 3.5±3.0 1.9±1.2 1.0±0.0 2620.9 Wiki 38 11.3±5.2 1.9±0.4 13.8±8.4 5.0±1.2 14.6±3.1 27065.6

ACL 255 10.9±5.5 1.9±0.9 – – 1.0±0.0 4987.5

Table 4.2: Corpus statistics for automatically created benchmark corpora. All values are averages over pairs with their standard deviation indicated by±. ACL does not contain relations.28

Table 4.1, all three are substantially bigger. But Table 4.2 also reveals that the datasets do not yet satisfy all requirements.

Both Biology and ACL only provide summaries for single documents, whereas we want to have summaries for document sets. In addition, ACL does not have real concept maps, but only concepts, and thus can only be used to evaluate concept mention extraction, but not the full task. Biology, while having relations, provides only very small concept maps, with especially few relations. Given the average number of relations (3.5) and concepts (6.9) per map, one can also easily see that the graphs are disconnected.

Wiki comes closest to our requirements because it provides multi-document summaries that are bigger and connected concept maps. Its main weakness are the relations, which have been obtained fully automatically. Since no annotator was involved, we do not have a guarantee that they express relationships in the same way a human would. The large size of their labels, compared to Biology (5.0 vs. 1.9 tokens), indicates that they follow a different style. The example in Figure 4.2 also reveals that some relations are rather complex clauses.

During evaluations, this dataset might also unfairly favor CM-MDS approaches that use similar OIE-based techniques for relation extraction.

In light of these limitations, we explore other techniques to create more high-quality datasets with reasonable effort in the next part of this chapter. That being said, we want to emphasize that the automatically created datasets can still be of use in experiments where their limitations are less relevant or if they are taken into account when interpreting quan-titative results. In this thesis, we will use the Biology and ACL datasets to evaluate concept and relation extraction approaches in Section 5.2 and the Wiki dataset as a second corpus to evaluate pipelines for the full task in Chapter 6.

28The values reported here differ slightly from those in (Falke and Gurevych, 2017c, Table 1) where statistics for the test split rather than the whole dataset are shown.