Very large language models for machine translation

(1)

UNI VE R S I TA S

SA RA V I E NSI S

Saarland University

Faculty of Natural Sciences and Technology I Department of Computer Science

Diplomarbeit

Very large language models for machine translation

vorgelegt von

Christian Federmann am 31.07.2007

angefertigt unter der Leitung von

Prof. Dr. Hans Uszkoreit, Saarland University/DFKI GmbH betreut von

Dr. Andreas Eisele, Saarland University/DFKI GmbH

begutachtet von

Prof. Dr. Hans Uszkoreit, Saarland University/DFKI GmbH Prof. Dr. Reinhard Wilhelm, Saarland University

(2)

(3)

Erkl¨arung

Hiermit erkl¨are ich, dass ich die vorliegende Arbeit selbstst¨andig verfasst und alle verwendeten Quellen angegeben habe.

Saarbr¨ucken, den 31. Juli 2007

Christian Federmann

Einverst¨andniserkl¨arung

Hiermit erkl¨are ich mich damit einverstanden, dass meine Arbeit in den Bestand der Biblio- thek der Fachrichtung Informatik aufgenommen wird.

Saarbr¨ucken, den 31. Juli 2007

Christian Federmann

(4)

(5)

Abstract

Current state-of-the-art statistical machine translation relies on statistical language models which are based on n-grams and model language data using a Markov approach. The quality of the n-gram models depends on the n-gram order which is chosen when the model is trained.

As machine translation is of increasing importance we have investigated extensions to improve language model quality.

This thesis will present a new type of language model which allows the integration of very large language models into the Moses MT framework. This approach creates an index from the complete n-gram data of a given language model and loads only this index data into memory.

Actual n-gram data is retrieved dynamically from hard disk. The amount of memory that is required to store such an indexed language model can be controlled by the indexing parameters that are chosen to create the index data.

Further work done for this thesis included the creation of a standalone language model server.

The current implementation of the Moses decoder is not able to keep language model data available in memory, instead it is forced to re-load this data each time the decoder application is started. Our new language model server moves language model handling into a dedicated process. This approach allows us to load n-gram data from a network or internet server and can also be used to export language model data to other applications using a simple communication protocol.

We conclude the thesis work by creating a very large language model out of the n-gram data contained within the Google 5-gram corpus released in 2006. Current limitations within the Moses MT framework hindered our evaluation efforts, hence no conclusive results can be reported. Instead further work and improvements to the Moses decoder have been identi- fied to be required before the full potential of very large language models can be efficiently exploited.

(6)

Zusammenfassung

Der momentane Stand der Technik im Bereich der statistischen Maschinenübersetzung stützt sich unter anderem auf die Verwendung von statistischen Sprachmodellen. Diese basieren auf N-grammen und modellieren Sprachdaten mittels eines Markow Modells. Die Qualität eines solchen N-gram Modells hängt von der verwendeten Ordnung des Modells ab, die beim Train- ing des Sprachmodells gewählt wurde. Da Maschinenübersetzung von wachsender Bedeutung ist, haben wir verschiedene Möglichkeiten zur Verbesserung der Qualität von statistischen Sprachmodellen untersucht.

Diese Diplomarbeit stellt eine neue Art von Sprachmodell vor, die es uns erlaubt, sehr große Sprachmodelle in das Moses MT Framework zu integrieren. Unser Ansatz erstellt zuerst einen Index der N-gram Daten eines gegebenen Sprachmodelles und lädt später nur diese Indexdaten in den Speicher. Die tatsächlichen N-gram Daten werden zur Laufzeit des Decoders dynamisch von der Festplatte in den Speicher geladen. Hierbei kann die Größe des Speichers, der für die Verwendung eines solchen indizierten Sprachmodells benötigt wird, gezielt durch entsprechend gewählte Indizierungsparameter kontrolliert werden.

Neben dem indizierten Sprachmodell beinhaltete diese Diplomarbeit ebenso die Erstellung eines unabhängigen Sprachmodellservers. Die gegenwärtige Implementierung des Moses De- coders ist nicht dazu in der Lage, die N-gram Daten eines Sprachmodells im Speicher zu halten. Stattdessen ist Moses gezwungen, diese Daten bei jedem Neustart erneut von der Festplatte zu laden. Unser neuer Sprachmodellserver verschiebt die Verarbeitung der Sprach- modelldaten in einen eigenen, dedizierten Prozess. Dieser Ansatz erlaubt es dann, jene Daten aus dem Netzwerk oder sogar von einem Server aus dem Internet zu laden und kann zudem verwendet werden, um die Sprachmodelldaten anderen Anwendungen zugänglich zu machen.

Hierf¨ur steht ein einfaches, text-basiertes Kommunikationsprotokoll bereit.

Abschließend haben wir versucht, ein gigantisches Sprachmodell aus den im letzten Jahr von Google veröffentlichten N-gram Daten des Google 5-gram Corpus zu generieren. Momentane Beschränkungen des Moses MT Frameworks behinderten unsere Evaluationsbemühungen, daher können keine abschließenden Ergebnisse angegeben werden. Stattdessen haben wir weitergehende Verbesserungsmöglichkeiten am Moses Decoder indentifiziert, die notwendig sind, bevor das gesamte Potenzial sehr großer Sprachmodelle ausgeschöpft werden kann.

(7)

Acknowledgements

I would like to express my gratitude to my supervisors Andreas Eisele and Hans Uszkoreit who provided me with a challenging and interesting topic for my diploma thesis. In the same way I want to thank Reinhard Wilhelm for his willingness to examine this thesis.

Andreas constant support and mentorship, his encouragement and continuous guidance have greatly contributed to the success of this work and are highly appreciated. Thanks a lot!

I also want to take the time to thank all those who have aided me in the creation of this thesis work. In alphabetical order, these are Stephan Busemann, Bertold Crysmann, Bernd Kiefer, Marc Schr¨oder, and Hendrik Zender. I am also indebted to everyone at the Language Techonology lab at DFKI.

Last but not least, I want to thank all my family and friends who have supported me over the years and thus had an important share in the successful completion of this thesis. In random order, these would be my grandmother, my grandfather, my parents, my sister Maike, my brother Alexander, my girlfriend Kira and the whole crazy bunch of friends out there, you know who you are. Thank you very much, your help is greatly appreciated.

(8)

(9)

Contents

List of Figures

3.1 Relation between n-gram prefixes and n-grams in a language model file . . . . 20

3.2 Increasing indexing: compression gain (y) for increasing subset threshold (x) 26 3.3 Decreasing indexing: compression gain (y) for increasing subset threshold (x) 27 3.4 Uniform indexing: compression gain (y) for increasing subset threshold (x) . 28 4.1 Design of an indexed language model . . . 34

4.2 IndexedLM cache overview . . . 39

4.3 Interactions between LanguageModelIndexed and IndexedLM . . . 43

4.4 Interactions between Moses and LanguageModelIndexed . . . 43

5.1 Design of a language model server . . . 50

5.2 Flow chart of the TCP server mode . . . 52

5.3 Flow chart of the IPC server mode . . . 54

5.4 Interactions between Moses and LanguageModelRemote . . . 58

(16)

List of Figures

(17)

List of Tables

List of Tables

1.1 Influence of increasing n-gram order . . . 2

1.2 Performance loss introduced by language model loading . . . 4

3.1 Character-level n-gram prefixes example . . . 19

3.2 Increasing Indexing example . . . 22

3.3 Decreasing Indexing example . . . 22

3.4 Uniform Indexing example . . . 23

3.5 Custom Indexing example . . . 23

4.1 Subset data for language model files . . . 35

4.2 Binary format for language model files . . . 36

4.3 Binary tree format for language model files . . . 37

4.4 C++ std::map index data structure . . . 38

4.5 Custom index tree index data structure . . . 38

4.6 Index tree with binary model index data structure . . . 38

4.7 Index tree with binary tree model index data structure . . . 38

4.8 N-gram retrieval example, all scores in log₁₀ format . . . 41

4.9 N-gram probability construction, all scores in log₁₀ format . . . 42

4.10 Changes to the Moses framework . . . 44

4.11 Additions to the Moses framework . . . 44

4.12 Overview of all evaluation language models . . . 45

4.13 SRI language model performance within the Moses MT framework . . . 46

4.14 Indexed language model performance within the Moses MT framework . . . . 46

5.1 Comparison of language model server modes . . . 55

5.2 Protocol commands for the language model server . . . 56

5.3 Changes to the Moses framework . . . 59

5.4 Additions to the Moses framework . . . 59

6.1 Google 5-gram corpus counts . . . 62

(18)

List of Tables

D.1 Increasing Indexing with Γ = [1,0,0,0,0] . . . 104

D.6 Decreasing Indexing with Γ = [1,0,0,0,0] . . . 105

D.11 Uniform Indexing with Γ = [1,1,1,1,1] . . . 107

D.16 Custom Indexing with Γ = [1,1,1,1,0] . . . 109

(19)

Chapter 1 Introduction

1.1 Motivation

Statistical machine translation (SMT) has proven to be able to create usable translations given a sufficient amount of training data. As more and more training data has become available over the last years and as there exists an increasing demand for shallow translation systems statistical machine translation represents one of the most interesting and active topics in current research on natural language processing [Callison-Burch and Koehn, 2005], [Koehn et al., 2007].

A typical SMT system is divided into two core modules: the translation model and the language model. The translation model takes a given source sentence and creates possible translation hypotheses together with corresponding scores which describe the probability of each of the hypotheses with regard to the source sentence. Thereafter, the hypotheses are sent to a statistical language model which rates each of the possibilities and assigns an additional score describing the likeliness of the given hypothesis in natural language text. The weighted combination of these scores is then used to determine the most likely translation.

1.1.1 Statistical Language Models

The termstatistical language modeldescribes a family of language models which model natural languages using the statistical properties of n-grams. N-grams are sequences of single words.

For these models it is assumed that each word depends only on a context of (n−1) words instead of the full corpus. This Markov model assumptiongreatly simplifies the problem of language model training thus enabling us to use language modeling for statistical machine translation systems.

(20)

Chapter 1 Introduction

First applications of n-gram language models were developed in the field of automatic speech recognition (ASR) [Jelinek, 1999]. The general approach of statistical language modeling has the advantage that it is not bound to a certain application area, in fact it is possible to use the same language model for SMT, ASR or OCR ¹.

There exist several known limitations of n-gram models, most notably they are not able to model long range dependencies as they can only explicitly support dependency ranges up to (n−1) tokens. Furthermore Markov models have been criticized for not capturing the performance/competence distinction introduced by Noam Chomsky.

However empirical experiments have shown that statistical language modeling can be applied to create usable translations. While pure linguistic theory approaches usually only model language competence the statistical n-gram models also implicitly include language performance features. Hence whenever real world applications are developed the pragmatic n-gram approach is preferable.

The quality of n-gram language modeling can directly be improved by training the models on larger text corpora. This is a property of the statistical nature of the approach. As we stated above large amounts of training data have become available over the last years, data which can now be used to build better statistical language models. This thesis work will concentrate on improving statistical language models for SMT.

1.2 State-of-the-art Language Models

As we have argued before, it has become a lot easier to collect large amounts of monolingual training data for language model creation. This enables us to use fourgrams or even fivegrams instead of just trigrams. In order to show an improved translation quality using such higher order n-grams, we have trained a baseline MT system and evaluated the overall performance using three different English language models: a trigram model, a fourgram model, and a fivegram language model.

n-gram order BLEU score runtime [s]

3 37.14 111

4 39.14 136

5 39.7 147

Table 1.1: Influence of increasing n-gram order

Table 1.1 lists theBLEU scores[Papineni et al., 2002] and theoverall runtimein seconds.

(21)

Section 1.2 State-of-the-art Language Models

All three language models have been trained from the same fraction of the Europarl corpus [Koehn, 2005], only the maximum n-gram order differed. The tests were performed using a set of 100 sentences.

We can observe an improved translation quality with increasing language model n-gram order.

At the same time, performance decreases as decoding becomes more complex when n-grams of higher order are used. Considering the fact that the performance loss is not too severe and an improvement to the translation quality is very welcome, it seems to be a sound assumption that fivegrams should be considered state-of-the-art for statistical machine translation.

As current language models are generated from enormous amounts of training data they will also need large amounts of memory to be usable within current machine translation systems.

The drawback of the SRI [Stolcke, 2002] language model implementation lies in the fact that all n-gram data has to be loaded into memory at the same time, even if only a fraction of all this data would be needed to translate the given source text.

Computer memory is becoming cheaper, however it remains an expensive resource, especially when compared to the low cost of fast hard disks. Therefore it seems to be a valuable effort to investigate new methods for language model handling which try to reduce memory usage.

Instead of using a single machine with large amounts of memory to translate a given source text, it is perfectly possible to distribute this task to several machines which only translate parts of the source text.

With the standard SRI language model, all these machines would require the same large amount of memory to handle a large language model. If instead we could use another language model with reduced memory requirements even for large n-gram sets, all computers within our cluster could be equipped with less memory and would still be able to translate their share of the source text. As this cluster solution would redistribute the whole workload to several machines, it would be even be possible for the new language model to take more time to complete compared to the original SRILM implementation.

Often clusters of several dozens or hundreds of machines are already available in large com- panies or institutions. These could easily be used to translate a given source text with such a new language model, the only requirement would be an adaption to the actual amount of memory available on each of the cluster nodes. Our new language model would create an index on the full n-gram data set and only load the index data to memory, the size of this index could be controlled to fit the memory limitations within our cluster.

(22)

As we have shown above, fivegram language models help to create better translations. How- ever they require more memory which could lead to problems once really large language models are built. As memory is an expensive resource, we propose a new, indexed language model instead which can be adapted to require less memory than the corresponding language model in SRI format.

1.3 Decoder Startup Times

The current implementation of the Moses decoder [Koehn et al., 2007] also suffers from slow startup times which are caused by language model loading. Even if the decoder would use the same language model for a certain amount of translation requests, it would still have to load the full n-gram data from hard disk each time it is started.

The following table shows how much of the actual decoder runtime is required to perform startup tasks which include the loading of language model data and phrase-table handling.

It compares these values to the full decoder runtime and shows the procentual runtime loss caused by loading the language model. The tests were performed using the same set of 100 sentences as in section 1.2.

n-gram order model size startup [s] runtime [s] loss

3 90 MB 172 274 63%

4 145 MB 157 273 58%

5 185 MB 170 293 58%

Table 1.2: Performance loss introduced by language model loading

These results clearly show that the Moses MT framework could benefit from a new language model class which would only interact with a preloaded language model instead of loading all n-gram data again and again. Such a preloaded language model could be hosted by a dedicated server, access could be possible from the same machine or even from a network.

1.4 Thesis Goals

In this diploma thesis we will describe a new language model class within the Moses MT framework. This language model class differs from the well known SRI model as it does not load all n-gram data into memory but only an index of all n-grams contained within the model data. Actual n-gram data is loaded dynamically from hard disk. This aims to reduce memory

(23)

Section 1.5 Thesis Overview

We will implement the indexed language model as a new class inside the existing Moses MT framework. The Moses decoder was chosen as it is the current state of the art in machine translation. Even better, it is developed and maintained as an open source community project and can be easily extended.

Next to the new language model class, we will develop a second extension, a language model server, which can be used to host language model data using a dedicated server. As we have seen above, decoder startup times are slowed down by the loading of n-gram data. These could be reduced by the introduction of a language model server.

Another advantage lies in the fact that decoder and language model would not have to be located on the same machine anymore, the language model server could also be hosted on a local network or the internet. Finally, this could enable language model usage in other applications which are not built using the Moses MT framework.

1.5 Thesis Overview

The complete thesis is divided into 7 chapters which are described below:

! Chapter 1: Introduction motivates the development of a new indexed language model and a language model server application to be used with the Moses MT framework. These enable the usage of very large language models with the Moses decoder.

! Chapter 2: Building A Baseline System describes how a baseline translation system can be setup on top of the Moses MT framework. This includes information on how to compile the SRILM toolkit and the Moses code. Both language model training and phrase-table generation are discussed and examples how to translate texts and to evaluate these translations are given.

! Chapter 3: N-gram Indexing defines the notion of character-level n-gram prefixes.

These can be used to efficiently index large sets of n-gram data and can hence be used for the indexed language model. Several indexing methods are discussed and compared with respect to compression rate and actual compression gain. Additionally, the implementation of the Indexer tool is described.

! Chapter 4: An Indexed Language Model shows the design and implementation of an indexed language model class within the Moses MT framework. Several possible data structures for the index data are presented and compared, integration into and interaction with the Moses decoder are discussed, and compatibility to the original SRILM implementation is evaluated.

(24)

! Chapter 5: A Standalone Language Model Server describes how we created a standalone language model server application which can be queried from the Moses decoder or other applications. Access to the server is possible using either TCP/IP connections or IPC shared memory methods. A simple server protocol is designed which can be used to lookup n-gram data.

! Chapter 6: A Google 5-gram Language Model shows how a very large language model can be created using the Google fivegram corpus which was released in late 2006.

As training with the SRILM toolkit failed, only a basic language model could be created.

Translation quality is evaluated and limitations are discussed.

! Chapter 7: Conclusion summarizes what we have learned while working on this diploma thesis. It describes what has been achieved and what has not been possible to do. Finally it discusses possible future extensions of the indexed language model and the language model server application.

(25)

Chapter 2 Building A Baseline System

2.1 Motivation

Not too long ago, statistical machine translation software was expensive, closed source and inflexible. The long time standard, the Pharaoh decoder [Koehn, 2004a] by Philipp Koehn, was only available in binary form and so even small modifications to the decoding system were not possible. Luckily, things have changed and improved in recent years.

With the introduction of the Moses decoder which was developed by a large group of volunteers lead by Philipp Koehn and Hieu Hoang [Koehn et al., 2007], a fully compatible MT system became available that was open for community modifications. This enables us to integrate new ideas into the decoding system and to try to create better translations.

However, before actually integrating support for very large language models into the Moses code, it is necessary to create a baseline system and evaluate performance and translation quality of this system. These values can then later be used to compare the newly created language model code to the current state of Moses decoder.

We will briefly discuss the steps which are needed to create such a baseline system on the following pages.

(26)

Chapter 2 Building A Baseline System

2.2 Requirements

The baseline system uses theMoses decoder, language models are created using theSRILM toolkit, word alignment during the training is done withGIZA++. All these can be freely downloaded from the internet and are robust, well-tested tools.

Our baseline system has been installed on a Linux machine with 32GB of RAM and four Dual Core AMD Opteron 885 CPUs. A second system has been installed on a 1.83 GHz MacBook with 2GB of RAM. It should also be possible to setup and train such a system on a Windows machine as all software is available for Windows as well, however this will not be explained in this document.

2.3 SRILM Toolkit

2.3.1 Description

The SRILM toolkit [Stolcke, 2002] allows to create and apply statistical language models for use in statistical machine translation, speech recognition and statistical tagging and seg- mentation. For machine translation, SRILM language models are currently thegold standard, support for them is already available inside the Moses MT system. The SRILM toolkit was designed and implemented by Andreas Stolcke.

2.3.2 Software

The SRILM toolkit can be obtained from the internet at:

http://www.speech.sri.com/projects/srilm/

The toolkit maybe downloaded free of charge under an open source community license mean- ing that it can be usedfreely for non-profit purposes. A Mailing list for support and exchange of information between SRILM users is available through:

srilm-user@speech.sri.com

(27)

Section 2.3 SRILM Toolkit

2.3.3 Installation

Assuming the downloaded SRILM archive file is named srilm.tar.gzit can be installed as follows:

$ export $BASELINE=/home/cfedermann/diploma

$ mkdir $BASELINE/srilm

$ cp srilm.tar.gz $BASELINE/srilm

$ cd $BASELINE/srilm

$ tar xzf srilm.tar.gz

This will create a folder named srilm containing the SRILM code inside the $BASELINE folder. In order to compile this code, the SRILM variable inside $BASELINE/srilm/Makefile has to be configured.

SRILM = /home/cfedermann/diploma/srilm

It is also necessary to disable TCL usage and any special optimizations (like-mtune=pentium3) inside the Makefile for the specific target machine. For example, if we try to build the SRILM toolkit on Mac OS X, we change$BASELINE/srilm/common/Makefile.machine.macosx. We add the following line to the Makefile and remove the existingTCL INCLUDEandTCL LIBRARY definitions if available.

NO_TCL = X

# TCL_INCLUDE =

# TCL_LIBRARY =

Now, the SRILM code can be compiled and installed:

$ cd $SRILM

$ make World

It is crucial to verify that the compilation process worked correctly otherwise the baseline system will not function properly. We can check this by calling the tests inside the

$BASELINE/srilm/test folder:

$ cd $SRILM/test

$ make all

In case of errors or any other problems during the installation process, there exists more detailed documentation in the file INSTALL inside the SRILM folder. It is recommended to add the SRILM binary folder, e.g. $SRILM/bin/macosx, to the global$PATHvariable.

(28)

2.3.4 Usage

The most important command of the SRILM toolkit is thengram-counttool which counts n- grams and estimates language models. There exist several command line switches to fine-tune the resulting language model, we will explain only some of them here. For more information refer to the respective man page:

$ man ngram-count

A sorted 5-gram language model from a given English corpus in en.corpus can be created using the following command:

$ ngram-count -sort -order 5 -interpolate -kndiscount\

-text en.corpus -lm en.srilm

The command line switches are explained below:

! -sort outputs n-gram counts in lexicographic order. This can be required for other tools within the SRILM toolkit and is mandatory for the indexed language model that will be presented later in this thesis.

! -order nsets the maximal order (or length) of n-grams to count. This also determines the order of the language model.

! -interpolatecauses the discounted n-gram probability estimates at the specified order n to be interpolated with estimates of lower order.

! -kndiscount activates Chen and Goodman’s modified Kneser-Ney discounting for n- grams.

! -text specifies the source file from which the language model data is estimated. This file should contain one sentence per line, empty lines are ignored.

! -lm specifies the target file to which the language model data is written.

(29)

Section 2.4 GIZA++ & mkcls

2.4 GIZA++ & mkcls

2.4.1 Description

GIZA++ [Och and Ney, 2003] is a tool to compute word alignments between two sentence aligned corpora,mkcls[Och, 1999] is a tool to train word classes using a maximum-likelihood criterion. Both tools were designed and implemented by Franz Josef Och.

2.4.2 Software

The original version of GIZA++ and mkcls can be downloaded at:

http://www.fjoch.com/GIZA++.html http://www.fjoch.com/mkcls.html

These versions of GIZA++ and mkcls will not compile with newer g++ 4.xcompilers which are standard on modern computer systems. There exist patched versions by Chris Dyer which resolve these issues. They are available from:

http://ling.umd.edu/~redpony/software/

2.4.3 Installation

Assuming the GIZA++ and mkcls archive files are namedGIZA++.tar.gzandmkcls.tar.gz, they can be installed like this:

$ export $BASELINE=/home/cfedermann/diploma

$ cd $BASELINE

$ tar xzf GIZA++.tar.gz

$ tar xzf mkcls.tar.gz

This will create two folders GIZA++-v2 and mkcls-v2 inside the$BASELINE folder. In order to compile the code for mkcls, it is just necessary to type:

$ cd $BASELINE/mkcls-v2

$ make

For GIZA++, a small change is required. The compiler flag -DBINARY SEARCH FOR TTABLE has to be added to the CFLAGSdefinition inside the original Makefile:

CFLAGS = $(CFLAGS_GLOBAL) -Wall -W -Wno-deprecated\

-DBINARY_SEARCH_FOR_TTABLE

(30)

GIZA++ and its accompanying tools can then be compiled using:

$ cd $BASELINE/GIZA++-v2

$ make opt snt2cooc.out

The resulting binaries are calledmkcls,GIZA++,snt2plain.out,plain2snt.out, andsnt2cooc.out.

In order to make life a little easier, these will be copied to a central tools folder which is placed inside the $BASELINE folder:

$ export $TOOLS=$BASELINE/tools

$ cd $BASELINE/GIZA++-v2

$ cp GIZA++ *.out $TOOLS

$ cp $BASELINE/mkcls-v2/mkcls $TOOLS

2.4.4 Usage

Both GIZA++ and mkcls will be called by Moses training scripts, you should not have to invoke them yourself. Documentation is available in the corresponding READMEfiles.

2.5 Moses Decoder

2.5.1 Description

Mosesis a statistical machine translation system which allows to train translation models for any given language pair for which a parallel corpus, i.e. a collection of translated texts, exists.

The Moses decoder works using a beam search [Koehn, 2004a] algorithm to determine the best translation for a given input. Translation is phrase-based [Koehn et al., 2003] and allows words to have a factoredrepresentation. Moses has been designed by a team headed by Philipp Koehn, implementation was mainly done by Hieu Hoang.

2.5.2 Software

The Moses source code can be obtained from the project website or the SourceForge Subver- sion repository. The latest development version of the source code can be checked out using the following command:

$ svn co https://mosesdecoder.svn.sourceforge.net/svnroot/\

mosesdecoder/trunk mosesdecoder

Stable releases of the software can be downloaded at:

(31)

Section 2.5 Moses Decoder

2.5.3 Installation

Assuming that the Moses source code is available in the$BASELINE folder, inside a subfolder mosesdecoder, it can be configured and compiled using the following commands. Please note that we have to specify the --with-srilm switch to enable SRILM language model usage:

$ cd $BASELINE/mosesdecoder

$ ./regenerate-makefiles.sh

$ ./configure --with-srilm=$BASELINE/srilm

$ make -j 4

After compilation has finished, the mosesbinary should be copied to thetoolsfolder:

$ cp $BASELINE/mosesdecoder/moses-cmd/src/moses $TOOLS

Moses includes a set of support tools which should be put inside a scripts folder inside the tools folder. To correctly configure the path settings for the scrips, we have to edit the file mosesdecoder/scripts/Makefile:

TARGETDIR=$(TOOLS)/scripts BINDIR=$(TOOLS)

The support tools can be generated using:

$ cd $BASELINE/mosesdecoder/scripts

$ make release

This will generate (yet) another subfolder scripts-YYYYMMDD-HHMM inside $TOOLS/scripts containing the current versions of the Moses scripts. To enable Moses to use them, the SCRIPTS ROOTDIR variable has to be exported and set:

$ export SCRIPTS=$TOOLS/scripts

$ export SCRIPTS_ROOTDIR=$SCRIPTS/scripts-YYYYMMDD-HHMM

2.5.4 Usage

The Moses decoder relies on a fully trained translation model for a given language pair. Before we will explain the steps necessary to create such a model, we will show the basic usage of the decoder. Assuming we have our Moses configuration file available inmoses.iniand want to translate source.text, the corresponding call to Moses looks like this:

$ moses -config moses.ini -input-file source.text

(32)

2.5.5 Additional Requirements

Moses requires some additional tools for the training and evaluation processes. For training, a tokenizer, and alowercaserare necessary. These tools can be obtained from the website of the 2007 ACL Workshop on Statistical Machine Translation at:

http://www.statmt.org/wmt07/scripts.tgz

Evaluation is done using the notion of BLEU [Papineni et al., 2002] scores. An appropriate tool for this is the NIST BLEU scoring tool which is available here:

ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl

2.6 Basic Training

2.6.1 Preparational Steps

Assume we want to train a translation model for German to English. The first step is to prepare the parallel corpus data. It has to be tokenized, lowercased and sentences which would be too long to handle (and their correspondences in the other language) have to be removed from the corpus.

Corpus preparation can be done as follows:

$ tokenizer.perl -l de < corpus.de > tokens.de

$ tokenizer.perl -l en < corpus.en > tokens.en

$ clean-corpus-n.perl tokens de en clean-corpus 1 40

$ lowercase.perl < clean-corpus.de > lowercased.de

$ lowercase.perl < clean-corpus.en > lowercased.en

2.6.2 Training Step

Once the corpus data is prepared, the actual training process can be started. Moses offers a very helpful training script which is explained in more detail on the Moses website:

http://www.statmt.org/moses/?n=FactoredTraining.HomePage In a nutshell, translation model training is done using:

$ train-factored-phrase-model.perl PARAMETERS

The training process takes a lot of time and memory to complete. Actual settings for the

(33)

Section 2.7 Minimum Error Rate Training

2.7 Minimum Error Rate Training

It is possible to perform an optimization on the scoring weights which are used to produce translations. This is called minimum error rate training (MERT) [Och, 2003] and works by iteratively optimizing scoring weights for a test document which is translated and then compared to a reference translation. Assuming we want to optimize weights using the files tuning.de andtuning.en and a trained translation model from German to English is available in the baseline folder, a MERT can be performed using:

$ tokenizer.perl -l de < tuning.de > input.tokens

$ tokenizer.perl -l en < tuning.en > reference.tokens

$ lowercase.perl < input.tokens > input

$ lowercase.perl < reference.tokens > reference

$ mert-moses.pl input reference moses moses.ini\

--working-dir $BASELINE/tuning --rootdir $SCRIPTS

This will produce optimized weights for the given tuning document and store results inside a tuningsubfolder in the baseline folder. As with the basic training script, themert-moses.pl script will take a long time to finish. The final step is to insert the new weights into the basic moses.ini:

$ reuse-weights.perl tuning/moses.ini < moses.ini\

> tuning/optimized-moses.ini

This results in an optimized version of the original moses.ini which is stored in a new configuration file optimized-moses.iniinside the tuningfolder.

2.8 Evaluation

Evaluation is done using the NIST BLEU scoring tool. Assuming a reference translation evaluation.enexists, we can evaluate the translation quality ofoutput.enwhich was translated from evaluation.deas follows:

$ mteval-v11b.pl -r evaluation.en -t output.en -s evaluation.de The different command line switches have the following meanings:

! -r denotes the reference translation

! -t denotes the translation output

! -s denotes the translation input

(34)

It is also possible to use themulti-bleu.perlscript to evaluate the translation quality. This is easier to use as it does not require to wrap the translation files into SGML tags. Our example would look like this if multi-bleu.perlwas used:

$ multi-bleu.perl evaluation.en < output.en

2.9 Summary

In this chapter we have shown how to setup a baseline system for statistical machine translation based upon open source software which is available at no charge. The system is built using the SRILM toolkit for language modeling and GIZA++ for both sentence alignment and phrase-table generation. Sentence decoding is done with the Mosesdecoder, evaluation can be performed using several BLEUscoring tools.

(35)

Chapter 3 N-gram Indexing

3.1 Motivation

Whenever statistical machine translation is done this involves statistical language models of the target language [Brown et al., 1993]. These models are used to rate the quality of inter- mediate translation results and help to determine the best possible translation for a given sentence. Current language models are often generated using the SRILM toolkit [Stolcke, 2002]

which was designed and implemented by Andreas Stolcke.

The standard file format for these models is called ARPA or Doug Paul format for N-gram backoff models. Informally spoken, a file in ARPA format contains lots of n-grams, one per line, each annotated with the corresponding conditional probability and (if available) a backoff weight. A single line within such a file might look like this:

-0.4543365 twilight zone . or like this, if a backoff weight is present:

-2.93419 the big -0.5056427

In recent years, the creation of high quality language models resulted in files up to 150 MB in size. Often these models only contained n-grams up to a maximum order of 3, i.e. trigrams.

Such models could easily be stored inside the memory of a single computer, access was realized by simple lookup of n-gram sequences from memory.

Nowadays things have changed. Due to an enormous amount of available source data, it is now possible to create larger and larger language models which could help to improve translation quality. These huge language models [Brants et al., 2007] will require a more clever way to handle n-gram access as several GB of memory could be necessary to store the full language model.

(36)

Chapter 3 N-gram Indexing

If not all of these n-grams are actually used for a specific translation task, parts of the allocated memory are just blocked without purpose and thus wasted. However, even if all superfluous n-grams inside a given large language model would be filtered out a priori, it is still very likely that we will experience problems with the amount of memory which is required to store the full language model.

3.1.1 Possible Solutions

There exist several possible solutions to integrate support for very large language models into the existing frameworks for statistical machine translation. For instance, it might be interesting to use a hash function based approach to reduce the n-gram data size requirements.

It is also possible to think of a method which loads only very frequent n-grams into memory and handles rare n-grams by hard disk access.

For this thesis, we decided to investigate a different approach which we will call character- level n-gram indexing. This technique has been chosen because it is easy to understand and as it allows a very fine-grained control on the amount of n-gram data which has to be loaded into memory. We will describe and define this approach on the following pages.

3.1.2 Character-level N-gram Indexing

Instead of loading the full n-gram data into the computer’s memory, this data is indexed using some (hopefully) clever indexing method and only the resulting index data is stored in memory. That way only a small fraction of the original data has to be handled online, the vast majority of the n-gram data will remain offlineand can be handledon demand. As the name implies, we will create the index data based on character-level n-gram prefixes.

It is quite clear that all n-gram computations then require lookup of the respective n-grams from hard disk. This will take more time than equivalent lookup operations from memory but otherwise it would not be possible to utilize a large language model at all once the computer cannot provide enough dedicated memory. Hard disk access can also be reduced by caching of already processed n-grams and by using a good indexing method which minimizes the average number of n-grams to read from hard disk for any given index key.

(37)

Section 3.1 Motivation

3.1.3 Definition: N-gram Prefix

We define n-gram indexing using the notion of so calledn-gram prefixes. Given an arbitrary n-gramw₁w₂...w_nand a set of parameters{Γ₁,Γ₂, ...,Γ_n}, Γ_i∈N⁺0, the character-level n-gram prefix set is computed as follows:

Key_i = w_i[0 : Γ_i] (3.1)

P ref ix_n = {Key₁, Key₂, ..., Key_n} (3.2) wherew[0 :m] denotes the m-prefix of a given stringw. Thus the indexing method takes the first Γi characters for each of the wordsw_i which create the full n-gram and creates the index key set out of them. A parameter value of 0 will give an empty index key ", any parameter value Γi > length(wi) will return the full wordwi as index key.

This is what we formally define as thecharacter-level n-gram prefix. Any indexing method of this family can be uniquely defined by the corresponding set of parameters{Γ₁,Γ₂, ...,Γ_n}. It is convenient to represent the complete n-gram prefix set as a string with spaces inserted as delimiters.

3.1.4 An Indexing Example

An Example should make the general idea clear. Assume we want to index the 5-gram ”the big ogre wants food” using n-gram prefixes. Given a uniform weighting of Γ₁ =...= Γ₅, the character-level n-gram prefixes for this n-gram, ordered by increasing parameters Γi, would look like this:

Γi character-level n-gram prefix 1 "t b o w f"

2 "th bi og wa fo"

3 "the big ogr wan foo"

4 "the big ogre want food"

5 "the big ogre wants food"

Table 3.1: Character-level n-gram prefixes example

(38)

When applied onto an actual language model file, each of the n-gram prefixes represents exactly the subset of n-grams which match the respective n-gram prefix. Each subset can be uniquely determined by the position of the first matching n-gram line within the original language model file and the total number of matching n-grams. Figure 3.1 illustrates the basic relationship between n-gram prefixes and n-gram subsets:

















































Figure 3.1: Relation between n-gram prefixes and n-grams in a language model file To index a language model using n-gram prefixes, the file position and the n-gram line count are stored as the index value for each of the n-gram prefixes. The n-gram prefixes themselves serve as index keys. In practice that might look similar to this:

"th is fe" → "393815:2"

The above representation encodes that the n-gram prefix ”th is fe” occurs at position 393,815 in the corresponding language model file and is valid for the following 2 lines of content.

Whenever we encounter an n-gram with this n-gram prefix and want to lookup the corresponding n-gram data, we would have to jump to position 393,815 inside the respective language model file, read in the following two lines and check whether the given n-gram is contained in any of them. If that is the case, we have found our n-gram and can return its conditional probability and backoff weight. If not, the n-gram is unknown as it is not contained within the language model data.

(39)

Section 3.2 Basic Algorithm

3.2 Basic Algorithm

The basic algorithm for language model indexing based on character-level n-gram prefixes is explained on the next page. In fact, it is quite simple: it requires a given language model in ARPA format and a set of parameters which will be used for computing the index keys of the n-gram prefix for each n-gram inside the language model.

Algorithm 1 character-level n-gram indexer

Require: language model fileLM in ARPA format, set of parameter values Γ ={Γ₁,Γ₂, ....,Γ_n}

1: index =∅

2: while ∃some n-gram in LM do

3: current-ngram←LM

4: index = index∪NGRAM-PREFIX(current-ngram, Γ)

5: end while

6: return index

The pseudo-code does the following:

! line 1 initializes the index set.

! lines 2-5 represent the main loop which iterates over all n-grams.

! line 3 reads the current ngram from the language model file.

! line 4computes the n-gram prefix for this n-gram and adds the result to the index set.

! line 6 finally returns the index set as result of the algorithm

The most important part of the algorithm is the choice of the actual indexing method used inside NGRAM-PREFIX, i.e. the choice of the set of parameters Γ. We will describe and define several possible indexing methods in the next section.

(40)

3.3 Indexing Methods

3.3.1 Increasing Indexing

This indexing method creates increasingly larger index keysKey_i for the wordsw_i of a given n-gram w₁ w₂ ... w_n:

Γ₁ = 1 (3.3)

Γ = {Γ1,Γ2, ...,Γn | ∀i >1 : Γi = Γi−1+ 1} (3.4) For our ”the big ogre wants food” example this would yield the n-gram prefix:

Γ character-level n-gram prefix [1, 2, 3, 4, 5] "t bi ogr want food"

Table 3.2: Increasing Indexing example

This method could be optimized by setting some maximum index position i_max after which all Γi, i > i_max are clipped and set to 0 so that they can be neglected when constructing the index.

3.3.2 Decreasing Indexing

Decreasing Indexing effectively meansreverse increasing indexing. Here the first index key is the largest and all subsequent index keys are of decreasing size.

Γ₁ = Γ_max, Γ_max∈N⁺ (3.5)

Γ = {Γ1,Γ2, ...,Γn | ∀i >1 : Γi = Γi−1−1, Γi∈N⁺0} (3.6) Assuming Γmax= 4, the n-gram prefix for the example sentence changes to:

Γ character-level n-gram prefix [4, 3, 2, 1, 0] "the big og w ""

Table 3.3: Decreasing Indexing example

Depending on the actual implementation of the indexing method, the emptyindex key"can

(41)

Section 3.3 Indexing Methods

3.3.3 Uniform Indexing

This is the easiest possible indexing method. It assumes uniform parameters Γ₁ = Γ₂ = ... = Γ_n which means that all index keys Key_i are of equal size:

Γ1 = Γmax, Γmax ∈N⁺ (3.7)

Γ = {Γ1,Γ2, ...,Γn | ∀i >1 : Γi = Γi−1} (3.8) Given Γmax= 3, the character-level n-gram prefix for our example looks like this if we apply uniform indexing:

Γ character-level n-gram prefix [3, 3, 3, 3, 3] "the big ogr wan foo"

Table 3.4: Uniform Indexing example

Again, it may be clever to define a maximum index position i_max to decrease index size and to reduce index construction time.

3.3.4 Custom Indexing

It is also possible to design acustom indexing method which simply relies on the given set of Γi parameter values and does not assume any relation between these. Formally:

Γ = {Γ1,Γ2, ...,Γn | ∀i≥1 : Γi ∈N⁺0} (3.9) Example:

Γ character-level n-gram prefix [3, 2, 0, 2, 1] "the bi " wa f"

Table 3.5: Custom Indexing example

Depending on the actual application of the indexing method, such a custom indexing approach might be a better choice than the three methods defined above as the custom Γi parameters give a more fine-grained control on the creation of the index data. However it is important to take care of the empty index key ". The decision to represent it as a blank space" "or to simply leave it out of the n-gram prefix has to be taken based on application needs.

(42)

3.4 Evaluation

For evaluation of these indexing methods, we will define the notions of compression rate, large subset rateandcompression gain. These can be used to effectively rate any possible indexing method.

3.4.1 Definition: Compression Rate

CompressionRate = 1− SIZE(index set)

SIZE(ngram set) (3.10)

The compression rate compares the size of the generated index data with the size of the original n-gram collection and can be used to determine thecompression factorof the indexing operation. A value of 0 represents no compression, any larger value represents some actual compression of the n-gram data.

3.4.2 Definition: Large Subset

Each index entry refers to a certain part of the original language model, a certain subset.

In order to access this data, the corresponding lines within the language model have to be looked up from hard disk. As hard disk access is relatively slow, it makes sense to keep the average subsets small and to prevent the creation of very large n-gram subsets.

For evaluation, we define a certain threshold, the so-called subset threshold which divides the n-gram data subsets into smalland large subsets.

3.4.3 Definition: Large Subset Rate

LargeSubsetRate = COU N T(large subsets)

SIZE(ngram set) (3.11)

The large subset rate describes the ratio between the line count of alllarge subsetsand the size of the original n-gram collection. Effectively, this is the part of the language model which is considered too costly to process. Hence a good indexing method should always try to minimize

(43)

Section 3.4 Evaluation

3.4.4 Definition: Compression Gain

CompressionGain = CompressionRate−LargeSubsetRate (3.12) Compression gain represents a measure to rate the quality of an indexing method. It compares the compression rate and the large subset rate. The first one is optimal if the index is very small, the second one is optimal if the number of large subsets is very small. As these two values are diametrically opposed, their difference represents the amount of language model data which does not have to be handled online and can be efficiently processed. Effectively, this is the amount of saving, the indexing method has achieved.

3.4.5 Evaluation Concept

An optimal indexing method should minimize both the size of the generated index (which maximizes the compression rate) and the line count of alllarge subsets (which maximizes the compression gain).

3.4.6 Evaluation Results

Indexing methods have been evaluated using an English language model containing 10,037,283 n-grams which was about 324MB in size. The maximum n-gram order was 5. The results showed that:

! increasing indexing yields good and robust compression rates and relatively good large subset values. Indexing up to 3 words seemed to suffice even for the 5-gram model.

! decreasing indexing gives the best compression rates at the cost of more large subsets.

Again, indexing only the first 3 words of each n-gram created the best results which were even better than the corresponding results from increasing indexing.

! uniform indexing constructs large index sets which results in a bad compression rate.

Small Γ values seem to work best, but compared to the other indexing methods, uniform indexing does not perform too well.

! custom indexing heavily relies on the chosen set of Γi parameters. While some of the custom results were remarkably good, others performed outstandingly bad.

More detailed information on the evaluation results is available in the following figures. The x-axis represents the subset threshold, the y-axis shows the compression gain. The full evaluation results are available in Appendix D.

(44)

"#$%&'%&&"($$&)%"""*$#+&# %( ( )( # 

Figure3.2:Increasingindexing:compressiongain(y)forincreasingsubsetthreshold(x)

(45)

Section 3.4 Evaluation

"#$%&'%&&"($$&)%"""*$#+&#

%(

(

)(

#    Figure3.3:Decreasingindexing:compressiongain(y)forincreasingsubsetthreshold(x)

(46)

"#$%&'%&&"($$&)%"""*$#+&# %( ( )( #





 

Figure3.4:Uniformindexing:compressiongain(y)forincreasingsubsetthreshold(x)