External Domain-Specific Gold Standard - Gold Standard Evaluation Results 91

II. Compound Splitting 61

8. Gold Standard Evaluation Results 91

8.4. External Domain-Specific Gold Standard

splitting Correct Wrong Metrics

↓approach split not split not faulty precision recall accuracy

no split 0 5,087 0 1,100 0 0% 82.22% 82.22%

basic freq. 575 3,976 1,112 107 417 27.33% 52.32% 73.56%

extended freq. 623 4,584 504 255 221 46.22% 56.69% 84.16%

POS 661 4,933 155 369 69 74.69% 60.15% 90.42%

Smor 990 4,895 193 32 77 78.57% 90.08% 95.12%

Smor -d 917 5,037 51 116 66 88.68% 83.44% 96.23%

Smor -d NN 677 5,056 32 373 49 89.31% 61.60% 92.66%

Table 8.5.: Accuracies of the different splitting approaches with respect to thelinguistic gold standard. The best numeric scores per column are boldfaced.

dard (reported in Section 8.2.2 above). While the frequency-based approaches suffer from poor precision which is due to heavy over-splitting (e.g. 1,112 wrong split for the basic frequencyapproach), both the precision and the overall performance rises, the more linguistic knowledge (POS, Smor) is added to the approaches. Overall, the Smor -d splittings fit the gold standard best, with an accuracy of 96.23%, despite the fact that Smor reaches higher recall (90.08% vs. 83.44%, and also most correct splits, namely 990) and Smor -d NN reaches higher precision (89.31% vs. 88.68%, mainly due to its high correct not split score of 5,056). Smor is the only splitting approach that scores reasonably high in both precision (88.68%) and recall (83.44%).

word gold annotation gloss details

1 Risikopotenzial Risiko{N}+Potenzial{N} risk potential split points are marked

“+”

2 Spinnenfinger Spinne|n{N}+Finger{N,V} spider fingers inserted letters are marked “|”

3 Farbbild Farb,e{N}+Bild{N,V} color picture deleted letters are marked

“,”

4 Mediengestalter Medi,um|en{N}+Gestalter{N} media designer substitutions: deletion and filler letter

5 Kultfiguren Kult{N}+Figur(en){N} cult figurs inflectional endings are marked “()”

6 Obstgärten Obst{N}+GArten{N} fruit orchard

umlautung phenomena are marked with capital letters (here: “A”)

Table 8.6.: Featured annotations of the external gold standard.

8.4.1. Annotation Details

Data/Creation The external domain-specific gold standard is based on data from a German computer magazine for semi-professional computer users, c’t⁴¹, which appears bi-weekly. All texts from the issues 01/2000 to 13/2004 were used, in total 117 magazines, adding up to 20,000 pages of A4 text (= 15 million tokens). After filtering out lower-cased⁴² and function words (using STTS’ list of closed word class members⁴³) from this text collection, 378,846 words remained. Among them were compound and simple words, nouns and words of other word classes. A word list derived from a German lexicon,⁴⁴ including additional hand-crafted entries by Marek (2006) was then used to filter out words that were neither compositional nor nouns. This procedure resulted in a list of 158,653 nominal compound words. The gold standard annotation was performed semi-automatically by using a simple compound splitter (similar to the one described in Koehn and Knight, 2003), that was asking for human advice in the (around 12,000) cases of doubt that occurred. All errors that Marek (2006) detected while further developing the weighted FST were fixed, so that in the end, the number of erroneous annotations was estimated to be about 3%.

40http://diotavelli.net/files/ccorpus.txt

41http://www.heise.de/ct

42The gold standard was designed for nominal compounds and German nouns are always upper-cased.

43= The Stuttgart Tübingen Tag Set, created 1995/99, cf. http://www.ims.uni-stuttgart.de/

projekte/corplex/TagSets/stts-table.html.

44CELEX-2, release 2.5

gold standard format our format format adaptation

1 Risiko{N}+potenzial{N} Risiko Potenzial substitute “+” with whitespace 2 Spinne|n{N}+Finger{N,V} Spinne Finger remove inserted letters

3 Farb,e{N}+Bild{N,V} Farbe Bild re-introduce deleted letters 4 Medi,um|en{N}+Gestalter{N} Medium Gestalter use original word stem 5 Kult{N}+Figur(en){N} Kult Figuren keep inflectional endings 6 Obst{N}+GArten{N} Obst Gärten keep umlautung

Table 8.7.: Required formatting adaptations for the external gold standard.

Featured Annotations Besides structural annotations (i.e. split points), the external gold standard features some additional annotations: each word and word part is an-notated with its word class(es): for example, “{N}” indicates nouns (cf. potenzial{N}, Table 8.6, row 1), while “{N,V}” indicates that the word (part) is either a noun or a verb (cf. bild{N,V} Table 8.6, row 3). Moreover, it indicates the required transforma-tions from simple nouns into compound modifiers (and vice versa); cf. Table 8.6, rows 2-4: insertions: Spinne → Spinnen, deletions: Farbe → Farb, substitutions: Medium

→ Medien). Finally, the external gold standard annotation includes base forms and inflectional endings (cf. Table 8.6, rows 5 and 6).

Required Adaptations The external gold standard features some annotations that either do not match the output of our compound splitter or they are not relevant to the evaluation of splitting accuracy: e.g. reduction to the base word form (= removal of inflectional endings) belongs to the latter of these two categories. Table 8.7 illustrates (minor) format adaptations that were performed. Furthermore, we re-merged split par-ticles in the external gold standard, as we allow separated parpar-ticles only for verbs:

word gloss gold standard format after modification Hauptaufgabe main task Haupt{N}+auf{PREP}+Gabe{N} Haupt{N}+Aufgabe{N}

Finally, verbs occurring in modifier positions are kept in a shortened stem representation (without any inflectional ending) in the external gold standard. In contrast, our format represents such verbs as lemmas (with infinitive form ending). We solved this by adding the regular German infinitive ending“en” to all modifying verbs, ignoring the very few phonotactically driven exceptions:

word gloss gold standard format after modification Sprengstoffe explosive substances spreng{V}+Stoff(e){N} sprengen{V}+Stoffe{N}

splitting Correct Wrong Metrics

↓ approach split not split not faulty precision recall accuracy

no split 0 4,408 0 154,245 0 0 2.78% 2.78%

basic freq. 84.924 1,006 3,402 1,439 67,882 54.37% 55.06% 54.16%

extended freq. 102,824 2,823 1,585 12,370 39,051 71.67% 66.66% 66.59%

POS 122,553 3,971 437 15,369 16,323 87.97% 79.45% 79.75%

Smor 135,490 4,112 296 5,681 13,074 91.02% 87.84% 87.99%

Smor -d 131,177 4,171 237 7,811 15,257 89.44% 85.04% 85.31%

Smor -d NN 130,838 4,190 218 8,174 15,233 89.44% 84.82% 85.11%

Table 8.8.: Accuracies of the different splitting approaches with respect to theexternal domain-specific gold standard. This consists of 158,653 nouns, whereof 154,245 are to be split. The best numeric scores per column arebold faced.

This applies only to modifiers that were assigned only {V} (like in the “sprengstoff ” example). Whenever the modifying tag was ambiguous, we chose not to treat the mod-ifier as verb and thus no infinitive ending was added: e.g. bau{V,N}+arbeiter{N} → bau{N}+arbeiter{N} (construction worker).

8.4.2. Results

Even for the external gold standard, we calculated the accuracies of the different split-ting approaches using the evaluation metrics as presented in Section 8.1.2 above. The accuracies on the external gold standard are given in Table 8.8. For this gold standard evaluation, we did bit compare different parameter settings of the extended frequency-based approach. We give the results for minimal part size of 4 characters and minimal part frequency of 3, as these yielded reasonably balanced precision and recall scores in the two previous gold standard evaluations. Despite its different characteristics (in terms of size, compound density and domain), the performance of the different splitting approaches on this gold standard deviates only slightly from the other gold standards.

It can be seen from Table 8.8 that Smor splits 135,490 of the 154,245 compounds correctly (87.84%), which is roughly 30% more than the basic frequency baseline yields (84,924 of 154,245, corresponding to 55.05%). Again, the splittings of Smor -d NN are most conservative in that most words are left correct not split, namely 4,190 and least words are wrong split (218). Moreover noticeable is the high number of wrong not split words of the POS-based splitting approach: 15,369, which we attribute to the high number of domain-specific (here: technical) single word terms that have not

occured in the lexicon which was used for filtering the data during pre-processing (see Section 6.3.1 for details). In terms of overall accuracy, Smor performs best on the external domain-specific gold standard, compared to Smor -d NN which performed best on the translational correspondences standard andSmor-d which performed best on the linguistic gold standard.

Translational Correspondence Gold Standard Results

splitting Correct Wrong Metrics

↓ approach split not split not faulty precision recall accuracy

no split 0 4,851 0 149 0 – – 97.02%

basic freq. 75 4,380 466 11 68 12.32% 48.70% 89.10%

extended freq. 85 4,624 222 16 53 23.61% 55.19% 94.18%

POS 92 4,730 116 32 30 38.66% 59.74% 96.44%

Smor 122 4,664 182 11 21 37.54% 79.22% 95.72%

Smor -d 128 4,730 116 15 11 50.20% 83.12% 97.16%

Smor -d NN 121 4,773 73 22 11 59.02% 78.57% 97.88%

Linguistic Gold Standard Results

splitting Correct Wrong Metrics

↓ approach split not split not faulty precision recall accuracy

no split 0 5,088 0 1,099 0 - 0.00% 82.23%

basic freq. 577 3,984 1,105 99 422 27.42% 52.55% 73.72%

extended freq. 634 4,598 491 241 223 47.03% 57.74% 84.56%

POS 656 4,935 154 367 75 74.12% 59.74% 90.37%

Smor 967 4,877 212 50 81 76.75% 88.07% 94.46%

Smor -d 894 5,018 71 135 69 86.46% 81.42% 95.56%

Smor -d NN 671 5,054 35 375 52 88.52% 61.11% 92.53%

External Domain-Specific Gold Standard Results

splitting Correct Wrong Metrics

↓ approach split not split not faulty precision recall accuracy

no split 0 4,408 0 154,245 0 0 2.78% 2.78%

basic freq. 84.924 1,006 3,402 1,439 67,882 54.37% 55.06% 54.16%

extended freq. 102,824 2,823 1,585 12,370 39,051 71.67% 66.66% 66.59%

POS 122,553 3,971 437 15,369 16,323 87.97% 79.45% 79.75%

Smor 135,490 4,112 296 5,681 13,074 91.02% 87.84% 87.99%

Smor -d 131,177 4,171 237 7,811 15,257 89.44% 85.04% 85.31%

Smor -d NN 130,838 4,190 218 8,174 15,233 89.44% 84.82% 85.11%

Table 8.9.: Accuracies of the different splitting approaches with respect to the different gold standard standards. The best numeric scores per column (separately for each gold standard) are boldfaced.

Im Dokument Morphological processing of compounds for statistical machine translation (Seite 125-130)