Integration of Compound Processing in Smt

I. Background 7

4. Statistical Machine Translation 35

5.2. Integration of Compound Processing in Smt

tuning training

testing

SMT Preprocessing

Translation Model

target reference

Language Model

Model Translation

adapted

target reference source

source

source (split)

source

target source target

(split)

update

scoring converged

scoring n−gram extraction

split compounds

word alignment

split compounds

compounds split

Figure 5.3.: German to English translation: compound splitting is performed as a pre-processing step, prior toSmttraining. Besides the compounds in the source language sections of the parallel training data, even the compounds in the tuning data and testing data must be split prior to translation.

testing tuning training

Postprocessing Preprocessing

SMT

Translation Model Language

Model

Model Translation

adapted source

source

source target (split)

target target

merge compounds

reference

reference target

(split)

target (split) n−gram extraction word alignment

con verged target

(split)

update

scoring

scoring target (merged) merge

compounds

target (merged) com pounds splitcom pounds

split

Figure 5.4.: English to German translation: in a pre-processing step, the compounds of the German target language data are split, prior to Smt training. After translation into this split representation, a post-processing step is required to merge simple words into compounds. Note that tuning is performed against a reference translation in which compounds are not split. In each iteration of MERT, we thus merge simple words of the output into compounds and thereby integrate compound merging into the scoring process.

trained on original English and a split representation of the German data. If German is the source language of the system, even the compounds of the tuning and testing sets must be split before translation. An illustration of this preprocessing, for the Ger-man to English translation direction is given in Figure 5.3. Note that the generalSmt architecture remains unchanged (cf. Figure 4.2 on page 40).

In the opposite translation direction, from English to German, a combination of pre-and postprocessing is required for an appropriate compound processing. Again, com-pounds are split prior to training and the Smt models are trained on original English

and a split representation of the German data. However, after translation, compounds must be re-merged before the output can be scored against a human reference transla-tion. An illustration of this combined pre- and postprocessing is given in Figure 5.4. Note that tuning is performed against a human reference translation in which compounds are not split. In each iteration of MERT, we thus merge simple words of the output into compounds. This way, the quality of the compound mergings (in terms of number of words) is implicitly scored. At testing time, the compounds of the German output must be merged, before the output is scored against a human reference translation.

Inflection Prediction Not only compound processing, but also inflection prediction is performed within a pre- and a postprocessing step. For the sake of simplicity of the illustration, the example given in Figure 5.4 does not include the lemmatisation and re-inflection component. In the real pipeline, compounds are split and lemmatised prior to training and in tuning the output is scored against a lemmatised version of the human reference translation in which compounds are not split. After translation of the testset, compounds are first merged and then inflection is predicted.

5.2.2. Other Approaches

In the following, we present alternative approaches for the integration of compound processing and inflection prediction inSmt. The first one restricts compound processing to word alignment and can be viewed as a variant of the pre-/postprocessing approach.

The other two approaches, namelyusing lattice andsynthesizing phrase tables go beyond the modification of the training data and somewhat interfere with the Smt model.

Restriction to Word Alignment Popović et al. (2006) showed that compound split-ting is beneficial for end-to-end Smt performance, even if it is only applied for word alignment. First, compounds are split prior to word alignment. Then, word alignment is performed as usual, on the original English data and the split German version of the data. The approach thus benefits from the advantages of compound splitting for word alignment, i.e. more 1:1 alignments at word level and higher frequency counts for sim-ple words. Before the phrase extraction takes place, the positions of the English words pointing to component words of a compound are adjusted to the position of the com-pounds. The phrase table is then built based on this adjusted word alignment and the original English and German data. The data preprocessing for this approach is restricted

to the training data. If German is the source language of the translation pair, the tuning and testing input data can remain in their original format. For the opposite translation direction, from English into German, no post-processing (e.g. in the form of compound merging) is required. This is promising for situations where a compound splitter is avail-able, but no device for merging the compounds afterwards. In our work, we performed some initial experiments using the approach of Popović et al. (2006). However, in line with Popović et al. (2006), we soon found that the results could not improve over the more sophisticated compound splitting and merging pipelines we implemented.

Lattice-based Approach Many German compounds are ambiguous: depending on the application and the context in which the compound has occurred, multiple different split-ting options might be suitable. For example, this concerns n-ary compounds with n>2, or parasite words like “Gastraum”, which, depending on their context can be split into either“Gast|raum” (= “guest|room”) or “Gas|Traum”(= “gas|dream”). In some cases, the compound should be left unsplit, i.e. because it is lexicalised, it has a non-compositional semantics, or it coincides with a proper noun.¹⁷

Dyer et al. (2008) present a translation approach based on lattices, in which different splitting options are stored compactly. Their approach allows to keep multiple different splitting options of compounds during training. At testing time, the finalSmtmodel can select the most suitable splitting for the current context. While the initial experiments of Dyer et al. (2008) focused on morpheme segmentations of Russian and Chinese, Dyer (2009) extends it to German compounds.

Phrase Table Synthesis More recently, Chahuneau et al. (2013) presented an ap-proach to integrate inflection and derivation into Smt through phrase table synthesis.

The approach is conceptually language-independent; Chahuneau et al. (2013) report on experiments for the translation pairs English to Russian, Hebrew and Swahili.

The approach works as follows: first, two translation models are trained, one on the original text and one on a stemmed version of the text. For the phrases of the latter model, inflections are predicted based on the context of the phrase in the source sen-tence and generated either with a rule-based analyser or an unsupervised approach. The resulting phrases are called synthetic phrases. This procedure allows to generate phrases that have not occurred in the parallel training data. The finalSmtmodel then combines the phrase table extracted from the original text and the synthetic phrases.

17We introduced all of these and other characteristics of German compounds in Section 2.1.

Im Dokument Morphological processing of compounds for statistical machine translation (Seite 78-82)