• Keine Ergebnisse gefunden

5.3 Predicate-Argument Analysis for German

5.3.3 Experiments

Chapter 5. Concept and Relation Extraction

Category Size Description

Directly applicable 38% Only replacement of POS tags and dependency types with German equivalent, one-to-one mapping exists.

Small changes 35% No one-to-one mapping exists, but alternative con-ditions, e.g. using POS instead of dependencies, or vice-versa, can be easily found.

Substantial changes 27% Rules cannot be used, complex alternative rules or resources have to be created instead.

Table 5.4: Analysis of the portability of PropS rules from English to German.

Following that analysis, we implemented a German version of PropS, named PropsDE. It uses Mate Tools for POS tagging, lemmatizing and parsing (Bohnet et al., 2013). Dependen-cies are collapsed and propagated with JoBimText (Ruppert et al., 2015). The rule set covers 89% of the English rules, lacking only the handling of raising-to-subject verbs and more advanced strategies for coordination constructions and tense detection. Similar to PropS, the implementation provides both PropS graphs and OIE extractions as its output. For the latter, similar to other OIE systems, PropsDE assigns confidence scores to extracted tuples.

It uses a logistic regression model that has been trained on 410 extractions annotated for correctness. Model features are the length of the input sentence, length of the extraction, its number of arguments, whether it contains punctuation and which dependency types and PropS edge labels are present in the corresponding part of the graph.

Based on correspondence with the authors of the English system and their estimation of the effort they put into building it, we can conclude that we implemented the German version with roughly 10% of the effort they reported, including both the conceptual analysis and the technical implementation. This shows that our approach of manually porting a rule-based system to a new language is a valid approach to overcome the lack of tools in a specific language with reasonable effort in a short amount of time. However, the analysis also pointed out that there are a range of challenges, specifically the rules requiring substantial changes, that make a fully automatic porting difficult. For target languages that are more different from English than German, these cases are presumably even more prevalent.

5.3. Predicate-Argument Analysis for German

of the fact that PropS (and PropsDE) can be used as OIE systems and follow the standard evaluation protocol for such systems.

Experimental Setup We manually label extractions made by PropsDE to assess its per-formance. For this purpose, we created a new dataset consisting of 300 German sentences, randomly sampled from three sources of different genres: news articles from the TIGER treebank (Brants et al., 2004), German web pages from CommonCrawl (Habernal et al., 2016b) and featured Wikipedia articles. For the treebank part, we use the text with both gold and parsed dependencies to analyze the impact that parsing errors have on PropsDE.

Every tuple extracted from the set of 300 sentences was labeled independently by two annotators as correct or incorrect. In line with previous work, they were instructed to label an extraction as incorrect if it has a wrong predicate or argument, including overspeci-fied and incomplete arguments, or if it is well-formed but not entailed by the sentence.

Unresolved co-references were not marked as incorrect. We observed an inter-annotator agreement of 85% (𝜅 = 0.63). For the evaluation, we merged the labels, considering an ex-traction as correct only if both annotators labeled it as such. Results are measured in terms of precision, the fraction of correct extractions, and yield, the total number of extractions.44 The latter is commonly used as an indicator of recall, which cannot be determined directly in this kind of evaluation. We also plot precision-yield curves obtained by gradually de-creasing a threshold for the extraction confidence. The confidence prediction model was trained on a separate development set.

Results From the whole corpus of 300 sentences, PropsDE extracted 487 tuples, yielding on average 1.6 per sentence with 2.9 arguments. 60% of them were labeled as correct. Table 5.5 shows that most extractions are made from Wikipedia articles, whereas the highest precision can be observed for newswire text. According to our expectations, web pages are most challenging, presumably due to noisier language. These differences between the genres can also be seen in the precision-yield curve (Figure 5.3).

For English, state-of-the-art systems show a similar performance. In a direct compari-son of several systems carried out by Del Corro and Gemulla (2013), they observed precision scores of 58% (Fader et al., 2011, ReVerb), 57% (Del Corro and Gemulla, 2013, ClausIE), 43%

(Wu and Weld, 2010, WOE) and 43% (Mausam et al., 2012, OLLIE) on datasets of similar genres. The reported yield per sentence is higher for ClausIE (4.2), OLLIE (2.6) and WOE (2.1), but smaller for Reverb (1.4). However, we note that in the evaluation, all systems were configured to output two-argument-tuples. For example, from a sentence such as

44Note that the yield metric here, commonly used for OIE systems, measures the number of extractions per sentence, while the concept and relation yield metrics used in Section 5.2.2 measure extractions per reference concept or relation.

Chapter 5. Concept and Relation Extraction

Genre Sentences Length Yield Precision

News* 100 19.3 1.42 78.87

News 100 19.3 1.44 70.83

Wiki 100 21.4 1.78 61.80

Web 100 19.2 1.65 49.09

Total 300 20.0 1.62 60.16

Table 5.5: Tuple extraction performance of PropsDE by text genre. News* indicates extraction from gold parses, not included in Total. Precision in percentages, length in tokens per sentence.

The principal opposition parties boycotted the polls after accusations of vote-rigging.

OLLIE can either make two binary extractions

(1) (the principal opposition parties - boycotted - the polls)

(2) (the principal opposition parties - boycotted the polls after - accusations of vote-rigging)

or just a single extraction with three arguments. PropS always extracts the combined tuple

(1) (the principal opposition parties - boycotted - the polls - after accusations of vote-rigging)

which is in line with the default configuration of more recent OIE systems.

For the sake of comparability, we conjecture that the yield of our system would increase if we broke down higher-arity tuples in a similar fashion: Assuming that every extraction with𝑛arguments, 𝑛 > 2, can be split into𝑛 − 1separate extractions, our system’s yield would increase from 1.6 to 3.0. That is in line with the numbers reported above for the binary configuration for English. Overall, this indicates a reasonable performance of our straightforward porting of PropS to German.

Extractions were most frequently labeled as incorrect due to false relation labels (32%), overspecified arguments (21%) and wrong word order in arguments (19%). Analyzing our system’s performance on the treebank, we can see that the usage of gold dependencies increases the precision by 8 percentage points, making parsing errors responsible for about 28% of the incorrect extractions. Since the Mate Tools parser is trained on the full TIGER treebank, including the news part of our experimental data, its error contribution on unseen data might be even higher.