tailieunhanh - Báo cáo khoa học: "Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases"

In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our suffix array-based data structure. We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality. . | Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases Chris Callison-Burch Colin Bannard University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW chris colin @ Josh Schroeder Linear B Ltd. 39 B Cumberland Street Edinburgh EH3 6RA josh@ Abstract In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our suffix array-based data structure. We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality. 1 Introduction Statistical machine translation SMT has an advantage over many other statistical natural language processing applications in that training data is regularly produced by other human activity. For some language pairs very large sets of training data are now available. The publications of the European Union and United Nations provide gigbytes of data between various language pairs which can be easily mined using a web crawler. The Linguistics Data Consortium provides an excellent set of off the shelf Arabic-English and Chinese-English parallel corpora for the annual NIST machine translation evaluation exercises. The size of the NIST training data presents a problem for phrase-based statistical machine translation. Decoders such as Pharaoh Koehn 2004 primarily use lookup tables for the storage of phrases and their translations. Since retrieving longer segments of hu man translated text generally leads to better translation quality participants in the evaluation exercise try to maximize the length of phrases that are stored in lookup tables. The combination of large corpora and long phrases means that the table size can quickly become unwieldy. A

TÀI LIỆU LIÊN QUAN