tailieunhanh - Báo cáo khoa học: "Boosting Statistical Machine Translation by Lemmatization and Linear Interpolation"
Data sparseness is one of the factors that degrade statistical machine translation (SMT). Existing work has shown that using morphosyntactic information is an effective solution to data sparseness. However, fewer efforts have been made for Chinese-to-English SMT with using English morpho-syntactic analysis. We found that while English is a language with less inflection, using English lemmas in training can significantly improve the quality of word alignment that leads to yield better translation performance. . | Boosting Statistical Machine Translation by Lemmatization and Linear Interpolation Ruiqiang Zhang1 2 and Eiichiro Sumita1 2 1National Institute of Information and Communications Technology 2ATR Spoken Language Communication Research Laboratories 2-2-2 Hikaridai Seiika-cho Soraku-gun Kyoto 619-0288 Japan rui qi ang. zhang eiichiro. sumita @atrj p Abstract Data sparseness is one of the factors that degrade statistical machine translation SMT . Existing work has shown that using morpho-syntactic information is an effective solution to data sparseness. However fewer efforts have been made for Chinese-to-English SMT with using English morpho-syntactic analysis. We found that while English is a language with less inflection using English lemmas in training can significantly improve the quality of word alignment that leads to yield better translation performance. We carried out comprehensive experiments on multiple training data of varied sizes to prove this. We also proposed a new effective linear interpolation method to integrate multiple homologous features of translation models. 1 Introduction Raw parallel data need to be preprocessed in the modern phrase-based SMT before they are aligned by alignment algorithms one of which is the well-known tool GIZA Och and Ney 2003 for training IBM models 1-4 . Morphological analysis MA is used in data preprocessing by which the surface words of the raw data are converted into a new format. This new format can be lemmas stems parts-of-speech and morphemes or mixes of these. One benefit of using MA is to ease data sparseness that can reduce the translation quality significantly especially for tasks with small amounts of training data. Some published work has shown that applying morphological analysis improved the quality of 181 SMT Lee 2004 Goldwater and McClosky 2005 . We found that all this earlier work involved experiments conducted on translations from highly inflected languages such as Czech Arabic and Spanish to English. .
đang nạp các trang xem trước