Disambiguation - Morphological Compound Splitting 77

II. Compound Splitting 61

7. Morphological Compound Splitting 77

7.2. Disambiguation

In the previous sections, we already mentioned that structurally identical analyses dif-fering only in one feature (e.g.case, cf. Figure 7.1 above) are discarded and that we filter out splittings into bound word parts (cf. Figure 7.2). However, the remaining analyses still have to be disambiguated in order to fine onebest splitting option.

Previous Work Demberg (2006), who used Smor for letter-to-phoneme conversion, reports that she found on average 2.4 different segmentations per word. In the past, the disambiguation of morphological analysers was often performed using context-sensitive POS-tags from a parser (Nießen and Ney, 2000), POS-based heuristic disambiguation rules (Hardmeier et al., 2010) or by training classifiers, see e.g. Habash and Rambow (2005) for Arabic or Yuret and Türe (2006) for Turkish.

Structure In the following, we describe our disambiguation approach, which consists of two steps: first, we restrict the analysis depth in Section 7.2.1, and finally, we use corpus-derived frequencies to disambiguate the remaining splitting options in Section 7.2.2.

7.2.1. Analysis Depth

Smor returns a deep morphological analysis for each word. However, for the present application of compound splitting in Smt, the aim is to find one best splitting option so that the each component word of the German compound ideally corresponds to one English words. A high-level linguistic analysis is thus mostly sufficient. In contrast, for applications e.g. in the field of lexical semantics, where Smor could be used to approximate the meaning of a word or compound, the deep analysis level might be more desirable.

Filter flag Note thatSmorfeatures an internal filter (“-d” for disambiguation) keeping only high-level analyses with the least number of morphemes. This leaves fully lexicalised compounds unsplit. As a consequence, opaque compounds are left unsplit, if they are covered bySmor’s lexicon.²⁷ An early approach by Rackow et al. (1992) pursues a sim-ilar strategy in that all words that have an own entry in a hand-crafted lexicon are left unsplit. This procedure is also in line with Schiller (2005), who found that human readers, when faced with output of an unweighted morphological analyser (similar toSmor) often prefer splittings into the smallest number of parts. Finally, Demberg (2006) used differ-ent settings ofSmorto find optimal segmentations for the task of grapheme-to-phoneme conversion with and without the restricted analysis depth option. Consider the exam-ple “Lebensmittelbereitstellung” (= “food supply”) in Figure 7.8, where we summarise many structurally different analyses of different depths. Using the “-d” flag for restricted analysis depth when analysing “Lebensmittelbereitstellung” (= “food supply”), only the

27See also Section 5.1.1 on compositionality (page 52 above).

> Lebensmittelbereitstellung

leben<V><NN><SUFF>Mittel<NN>Bereitstellung<+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>Mittel<NN>be<VPREF>reiten<V>Stellung<+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>Mittel<NN>be<VPREF>reiten<V>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>Mittel<NN>bereit<ADJ>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>Mittel<NN>bereit<ADJ>Stellung<+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>Mittel<NN>bereit<VPART>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>mittel<ADJ>be<VPREF>reiten<V>Stellung<+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>mittel<ADJ>be<VPREF>reiten<V>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>mittel<ADJ>bereit<ADJ>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>mittel<ADJ>bereit<ADJ>Stellung<+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>mittel<ADJ>bereit<VPART>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

leben<V><NN><SUFF>mittel<ADJ>Bereitstellung<+NN><Fem><Nom><Sg>

Leben<NN>mittel<ADJ>Bereitstellung<+NN><Fem><Nom><Sg>

Leben<NN>mittel<ADJ>be<VPREF>reiten<V>Stellung<+NN><Fem><Nom><Sg>

Leben<NN>mittel<ADJ>be<VPREF>reiten<V>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Leben<NN>mittel<ADJ>bereit<ADJ>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Leben<NN>mittel<ADJ>bereit<ADJ>Stellung<+NN><Fem><Nom><Sg>

Leben<NN>mittel<ADJ>bereit<VPART>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Leben<NN>Mittel<NN>Bereitstellung<+NN><Fem><Nom><Sg>

Leben<NN>Mittel<NN>be<VPREF>reiten<V>Stellung<+NN><Fem><Nom><Sg>

Leben<NN>Mittel<NN>be<VPREF>reiten<V>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Leben<NN>Mittel<NN>bereit<ADJ>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Leben<NN>Mittel<NN>bereit<ADJ>Stellung<+NN><Fem><Nom><Sg>

Leben<NN>Mittel<NN>bereit<VPART>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Lebensmittel<NN>Bereitstellung<+NN><Fem><Nom><Sg>

Lebensmittel<NN>be<VPREF>reiten<V>Stellung<+NN><Fem><Nom><Sg>

Lebensmittel<NN>be<VPREF>reiten<V>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Lebensmittel<NN>bereit<ADJ>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

Lebensmittel<NN>bereit<ADJ>Stellung<+NN><Fem><Nom><Sg>

Lebensmittel<NN>bereit<VPART>stellen<V>ung<SUFF><+NN><Fem><Nom><Sg>

(a) Summary of structurally differentSmoranalyses. Note thatSmoranalyses are not ranked according to analysis depth.

leben

<V>

Leben

<NN>

mittel

<ADJ>

Mittel

<NN>

bereiten

<V>

bereit

<ADJ>

<NN>

Stellung

Lebensmittel

<NN> Bereitstellung

<NN>

(b) Illustration of possible splittings based on the above analyses. Split-tings into bound word parts like prefixes (“be-”) or suffixes (“-ung”) are blocked.

Figure 7.8.: Deep morphological analysis of “Lebensmittelbereitstellung” (= “food sup-ply”), with“leben” = “to live”,“Leben” = “life”,“Lebensmittel” = “food”,“Mittel” = “av-erage/means”, “mittel” = “mid”,“bereiten” = “to prepare”, “reiten” = “to ride”,“bereit” =

“ready”,“stellen” = “to put”, “Stellung”= “position”,“Bereitstellung” = “supply”

analysis:Lebensmittel<NN>Bereitstellung<+NN><Fem><Nom><Sg>remains, and no more disambiguation is required. This internal filtering is helpful to prevent un-wanted splittings into too many word parts, e.g. splitting “Lebensmittelbereistellung”

(= “centre of life”), which should be split into “Leben|Mittelpunkt” (= “life|centre”), but the only analysis returned when using the internal disambiguation filter is“Lebensmittel|

Punkt” (= “food|point”), because“Lebensmittel” is lexicalised inSmor. While this split-ting is morphologically sound, it is semantically implausible.

Hierarchy Smor’s implementation (as a finite-state-transducer, see Section 3.2 for de-tails) does not allow for a hierarchically structured segmentation of compound words that consist of more than two component words. For example, the semantics of the Ger-man compound“Turbinenpassagierflugzeug”varies depending on its context and whether a right-branching(Turbine(Passagier|Flugzeug)) (= (turbine(passenger|air- craft))) or a left-branching((Turbine|Passagier)Flugzeug)(((turbine|passenger)aircraft)) word struc-ture is assumed. Smor’s analysis of the word only reveals that it consists of the three parts“Turbine” (= “turbine”), “Passagier” (=“passenger”) and “Flugzeug” (= “aircraft”).

7.2.2. Word part Frequencies

After having deleted structurally identical analyses (only differing in features like e.g.

case ornumber), and restricting the analysis depth, we finally use corpus frequencies to disambiguate the remainder set of analyses in order to select one splitting option. The disambiguation procedure we use is essentially the same as described in Section 6.1.2 above. We briefly repeat it here for readability. We follow Koehn and Knight (2003), who used the geometric mean of substring frequencies to find optimal split points.

We calculate the geometric mean scores of splitting option based on the natural log frequencies of word parts given by the Smor analyses.²⁸ The splitting that maximises the geometric mean score is picked. The following formula is adapted from (Koehn and Knight, 2003, , p.189):

argmax_S(P

pi∈S

log(count(pi))

n )

28We use the monolingual training data of the WMT shared task 2009, to derive word and word part frequencies. It consists of about 146 million words.http://www.statmt.org/wmt09

Possible splittings score alternativ (210) stromern (3) Zeuger (1) 2.18

alternative_ADJ to roam creator

alternativ (210) Strom (5,499) Erzeuger (1,473) 7.11

alternativeADJ power producer

alternativst (1) Rom (5,132) Erzeuger (1,473) 0

most alternative Rome producer

Alternative (5,036) stromern (3) Zeuger (1) 3.20

alternative_{N N} to roam creator

Alternative (5,036) Strom (5,499) Erzeuger (1,473) 8.14

alternativeN N power producer

alternativ (210) Stromerzeuger (136) 5.17

alternative_ADJ power producer

alternativst (1) Romerzeuger (1) 0

most alternative Rome producer

Alternative (5,036) Stromerzeuger (136) 6.71

alternative_{N N} power producer

alternativstromern (1) Zeuger (1) 0

to roam alternatively creator

Alternativstrom (1) Erzeuger (1,473) 0 alternative power producer

Figure 7.9.: All possible splittings and recombinations for“Alternativstromerzeuger” (=

“alternative power producer”) with restricted analysis depth, including word frequencies (in “()”) and geometric mean scores (cf. column score).

withS = split, p_i = part, n = number of parts. Whenever a word part has not occured in the data (thus having a frequency of 0), the geometric mean score for this splitting option was set to 0. In a second step, we generate all possible re-combinations of these word parts and calculate the geometric mean scores for those as well.

A detailed example is given in Figure 7.9 whereSmor, when used with the disambigua-tion flag, still returns five structurally different analyses forAlternativstromerzeuger (=

“alternative power producer”). We give corpus frequencies (in brackets) and geometric mean scores (rightmost column) for all of the five splitting options and recombinations of word parts within them. It can be seen that the splitting into“Alternative|Strom|Erzeuger”

(= “alternative|power|producer”) scores highest and is thus picked. However, in cases where the natural log frequency of the word as a whole exceeds the geometric mean score of the splitting options, the word is left unsplit. This concerns e.g lexicalised

combina-tions that lost their compositionality over time. Consider the case of “Armaturenbrett”

(= “dashboard”), which occurred 211 times and scored 5.35, while its splitting“Armatur”

(15), “Brett” (423) (= “armature|board”) yields a score of only 4.37.

Im Dokument Morphological processing of compounds for statistical machine translation (Seite 107-112)