Future Work - Very large language models for machine translation

After having finished work on this thesis, we have shown that indexed language models within the Moses MT framework are feasible and can be used to utilize very large language models. As we have seen, the memory requirements can be adapted using a suitable set of Γi parameters. The whole implementation should now be used underreal world conditionsto improve the overall stability and performance of the system. The following sections propose several ideas for improvements and future work in the field of indexed language models.

7.3.1 Improved Performance

First and foremost it seems to be of significant importance to optimize the processing speed of the indexed language model class and its underlying foundations. If the performance loss can be reduced this will make the complete system more usable for experimentation. This would also allow for a broader dissemination of this work.

To improve the efficiency of the indexed language model and its integration into the Moses MT framework it seems to be reasonable to develop an improved n-gram cache inside the top-level Moses class LanguageModelIndexed replacing the n-gram cache which is currently located in the low-level IndexedLM class.

7.3.2 Separation of Language Model Data

It would also be interesting to separate language model data into a small fraction which is always available inside memory and a large fraction that is accessed using the indexed language model paradigm. For instance it might reduce the number of hard disk accesses if we kept all unigram data available in memory while larger n-grams would still be loaded from disk. Thishybrid approachwould not require much additional memory, yet the possible performance gain is tempting.

7.3.3 Batched N-gram Requests

When trying to work with our language model server we experienced problems with the internal design of Moses language model handling. At the moment there exists no way to collect multiple n-gram requests which are then sent in a batched request. As the availability of such batched requests could greatly improve the overall system performance for the remote

Section 7.3 Future Work

It might be possible to collect all n-gram requests on phrase or sentence level or to use batches of a pre-defined size. However this approach will most likely require several complex changes to the Moses decoder.

7.3.4 More Flexible Phrase-tables

The large Google language model did not yield any measurable improvement in translation quality as the phrase-table prevented the Moses decoder to access any of the additional information contained within the language model. The current implementation only works for tokens that are contained within the phrase-table, all other words are treated as unknown words and do not contribute to the overall translation quality.

As we want to utilize the vast amounts of n-gram data which are provided by n-gram corpora such as the Google 5-gram corpus we have to find new ways to handle words that are unknown to the phrase-table. It is perfectly possible that an unknown token is contained within the language model data and could thus be used to create a better translation. This would require changes to Moses internal phrase scoring.

7.3.5 Hybrid Language Models

Last but not least, it also seems a worthwhile effort to explore the advantages and problems of combined hybrid language models. Instead of using only a single language model we could use a small in-domainlanguage model in SRI format and combine that with a large out-domain indexed language model. Together both models could improve translation quality and reduce the amount of error caused by the domain of the source text.

Chapter 7 Conclusion

Appendices

Appendices

Appendix Introduction

The following appendices try to give a more detailed insight into the program code and class design which have been developed as part of this diploma thesis. As the full source code is way to large to be wholly included into this document, only a chosen subset of important code is printed and documented. For more details, refer to comments within the source code.

Source Code License

All source code developed as part of this thesis is c0 2007 by Christian Federmann.

Redistribution and use in source and binary forms, with or without modification, are permit-ted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of con-ditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MER-CHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPE-CIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEG-LIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Appendix A

N-gram Indexing Code

The indexing tool has been designed and implemented usingC++. It is based on the Indexer class and built using the CmdLine class and the Debug macros. This appendix will briefly introduce the nuts and bolts of the class design and provide further informations on the program implementation.

The source code is freely available athttp://www.cfedermann.de/diplomaunder the license terms printed on page 73.

A.1 Class: Indexer

The Indexer class takes care of parsing a single or multiple language model files in ARPA format. It uses a set of given Γ_i parameters to create index data out of the language model data conforming to one of the indexing methods defined in chapter 3. It also handles uni-gram vocabulary creation and writes out sorted model files for each of the given language models.

Im Dokument Very large language models for machine translation (Seite 88-93)