• Keine Ergebnisse gefunden

5.3 Predicate-Argument Analysis for German

5.3.2 Porting Rules to German

In our analysis, we assess for each rule of the converter that transforms a dependency graph to the PropS graph whether and how it can be used on German text. A rule transforms a part of the graph if it fulfills conditions referring to dependency types, part-of-speech (POS) tags and lemmas. The following are two simple rules used in the English version of PropS:

if graph has edgedet(X, Y) then deleteY

if graph has edgenn(X, Y) then mergeX andY

In line with the English system, which works on collapsed and propagated Stanford de-pendencies (de Marneffe and Manning, 2008), we assume a similar input representation for German that can be obtained with a set of collapsing and propagation rules provided by Ruppert et al. (2015) for TIGER dependencies (Seeker and Kuhn, 2012).

Overall, we find that most rules can be used for German, in particular because some syntactic differences, such as freer word order (Kübler, 2008), are already masked by the dependency representation (Seeker and Kuhn, 2012). We identified three groups of rules that can be transferred to German with different amounts of effort and also identified the need for some additional rules. In the next sections, we discuss these groups.

Directly Applicable Rules About 38% of the rule set can be directly ported to German, solely replacing dependency types, POS tags and lemmas with their German equivalents.

As an example, the PropS rule removing negation tokens looks forneg dependencies in the graph, for which a corresponding typeng exists in the German tag set. We found similar correspondences to remove punctuation and merge proper nouns and number compounds.

In addition, we can also handle appositions and existentials with direct mappings.

Rules Requiring Small Changes For 35% of the English rules, small changes are necessary, mainly because no direct mapping to the German tag set is possible or the annotation style differs. For instance, while English has a specific type det to link determiners to their

5.3. Predicate-Argument Analysis for German

DESehenswertsinddieOrteSanJoseundSanAndres,dieandernördlichenKüstedesPetén-Itzá-Seesliegen. ENWorthseeingarethetownsSanJoseandSanAndres,whicharelocatedonthenorthernshoreoflakePetén-Itzá. SehenswertsinddieOrteSanJoseundSanAndres,dieandernördlichenKüstedesPetén-Itzá-Seesliegen.

root pdnksb pncnk pnc cj_und $ sb nk nk mo_an nkag

rc $ SehenswertOrteSanJoseundSanAndresnördlichenKüstePetén-Itzá-Seesliegen WorthseeingtownsSanJoseandSanAndresnorthernshoreLakePetén-Itzálocated

prop_ofmo

d conj_und

conj_und

subj prep_an modposs (1)(dieOrteSanJoseundSanAndres-liegen-andernördlichenKüstedesPetén-Itzá-Sees) (2)(dieOrteSanJoseundSanAndres-sehenswert) Figure5.2:PropSrepresentationforaGermansentencefromTIGER.WeshowthesentenceinGermanandEnglish,itssyntacticstructure incollapseddependenciesandthetransformedPropSgraph.Greyboxesinthegraphvisualizationindicatepredicatenodes.Asexamples ofPropStransformations,notehowpropernounsbecomesinglenodes,determinersandpunctuationgetdroppedandtheadjectiveatthe beginningbecomesapredicateinsteadofthecopularverb.TwoOIEtuples,oneunaryandonebinary,areextractedfromthissentence.

Chapter 5. Concept and Relation Extraction

governor, a more generic typenk(noun kernel modifier) is used in German that also occurs in other cases. Instead, determiners can be easily detected by part-of-speech, tagged as ART, as the following example illustrates:

Ich bin in die Schule gegangen .

PPER VAFIN APPR ART NN VVPP $.

I am to the school gone .

root

sb

oc mo nk

nk

$

Another type of difference exists with regard to the representation of auxiliary verb con-structions. In Stanford dependencies, main verbs govern all auxiliaries, whereas in TIGER dependencies, an auxiliary governs the main verb. The above example shows this forgone and am. Therefore, all rules identifying and removing auxiliaries and modals have to be adapted to account for this difference.

With similar changes as discussed for determiners, we can also handle possessive and copular constructions. The graph for Michael's bicycle is red, for example, features an additional predicatehaveto explicate the implicit possessive relation. The copular verbis is omitted andredbecomes an adjectival predicate in the graph representing this sentence:

haben Michael Fahrrad rot

have Michael bicycle red

prop of subj

obj poss

Moreover, conditional constructions can be processed with slight changes as well. Missing a counterpart for the typemark, we instead look for subordinating conjunctions by part-of-speech. In fact, we found conditionals to be represented more consistently across different conjunctions, making their handling in German easier than in English.

Rules Requiring Substantial Changes More substantial changes are necessary for the remaining 27% of the rules. To represent active and passive in a uniform way, PropS turns the subject into an object and a potential by-clause into the subject in passive clauses. For English, these cases are indicated by the presence of passive dependencies such asnsubjpass. For German, however, no direct counterparts exist and instead passive constructions use the same types as active ones. The following example illustrates this for a short sentence:

5.3. Predicate-Argument Analysis for German

The house was built root det

nsubjpass auxpass

Das Haus wurde gebaut root

nk sb oc

As an alternative strategy, we instead look for past participle verbs (by part-of-speech) that are governed by a form of the auxiliarywerden (Schäfer, 2015). Instances of the German static passive (Zustandspassiv) are, in contrast, handled like copulas.

Another deviation from the English system is necessary for relative clauses. PropS heavily relies on the Stanford dependency converter, which propagates dependencies of the relative pronoun to its referent. The German collapser does not have this feature, and we therefore implement it as an additional transformation. As an example, consider Figure 5.2, where thesbdependency fromliegentodieis propagated to the referentOrtein the PropS graph (and is labeled assubj in PropS instead ofsb, which is used in TIGER).

To abstract away from different tenses, PropS represents predicates with their lemma, indicating the original tense as a feature, as detected with a set of rules operating on POS tags. For German, no tense information is contained in POS tags, but instead, a morpho-logical analysis can provide it. Determining the overall tense of a sentence based on that requires a new set of rules, as the grammatical construction of tenses differs between Ger-man and English. PropS also tries to heuristically identify raising constructions, in which syntactic and semantic roles of arguments differ. In German, this phenomenon occurs in similar situations, such as inMichael scheint zu lächeln(Michael seems to smile), in which

Michael is not the semantic subject of scheinen, though syntactically it is. To determine these cases heuristically, an empirically derived list of common raising verbs, such as done by Chrupala and van Genabith (2007) for English, needs to be created for German.

Additional Rules An additional step that is necessary during the lemmatization of verbs for German is to recover separated particles. For example, a verb like ankommen (arrive) can be split in a sentence such asEr kam an(He arrived), moving the particle to the end of the sentence, with a potentially large number of other tokens in between. We can reliably reattach these particles based on the dependency parse. Another addition to the rules that we consider important is to detect subjunctive forms of verbs and indicate the mood with a specific feature for the predicate. A morphological analysis provides the necessary in-put. Compared to English, the usage of the subjunctive is much more common, usually to indicate either unreality or indirect speech (Thieroff, 2004).

Table 5.4 summarizes our analysis of PropS’ portability to German. With 38% of the rules being directly transferable and 35% requiring only small changes, the necessary ef-fort to create a German version of PropS seems to be substantially smaller than creating a complete rule set for German from scratch.

Chapter 5. Concept and Relation Extraction

Category Size Description

Directly applicable 38% Only replacement of POS tags and dependency types with German equivalent, one-to-one mapping exists.

Small changes 35% No one-to-one mapping exists, but alternative con-ditions, e.g. using POS instead of dependencies, or vice-versa, can be easily found.

Substantial changes 27% Rules cannot be used, complex alternative rules or resources have to be created instead.

Table 5.4: Analysis of the portability of PropS rules from English to German.

Following that analysis, we implemented a German version of PropS, named PropsDE. It uses Mate Tools for POS tagging, lemmatizing and parsing (Bohnet et al., 2013). Dependen-cies are collapsed and propagated with JoBimText (Ruppert et al., 2015). The rule set covers 89% of the English rules, lacking only the handling of raising-to-subject verbs and more advanced strategies for coordination constructions and tense detection. Similar to PropS, the implementation provides both PropS graphs and OIE extractions as its output. For the latter, similar to other OIE systems, PropsDE assigns confidence scores to extracted tuples.

It uses a logistic regression model that has been trained on 410 extractions annotated for correctness. Model features are the length of the input sentence, length of the extraction, its number of arguments, whether it contains punctuation and which dependency types and PropS edge labels are present in the corresponding part of the graph.

Based on correspondence with the authors of the English system and their estimation of the effort they put into building it, we can conclude that we implemented the German version with roughly 10% of the effort they reported, including both the conceptual analysis and the technical implementation. This shows that our approach of manually porting a rule-based system to a new language is a valid approach to overcome the lack of tools in a specific language with reasonable effort in a short amount of time. However, the analysis also pointed out that there are a range of challenges, specifically the rules requiring substantial changes, that make a fully automatic porting difficult. For target languages that are more different from English than German, these cases are presumably even more prevalent.