tailieunhanh - Báo cáo khoa học: "Segmentation for English-to-Arabic Statistical Machine Translation"

In this paper, we report on a set of initial results for English-to-Arabic Statistical Machine Translation (SMT). We show that morphological decomposition of the Arabic source is beneficial, especially for smaller-size corpora, and investigate different recombination techniques. We also report on the use of Factored Translation Models for Englishto-Arabic translation. | Segmentation for English-to-Arabic Statistical Machine Translation Ibrahim Badr Rabih Zbib James Glass Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge MA 02139 USA iab02 rabih glass @ Abstract In this paper we report on a set of initial results for English-to-Arabic Statistical Machine Translation SMT . We show that morphological decomposition of the Arabic source is beneficial especially for smaller-size corpora and investigate different recombination techniques. We also report on the use of Factored Translation Models for English-to-Arabic translation. 1 Introduction Arabic has a complex morphology compared to English. Words are inflected for gender number and sometimes grammatical case and various clitics can attach to word stems. An Arabic corpus will therefore have more surface forms than an English corpus of the same size and will also be more sparsely populated. These factors adversely affect the performance of Arabic English Statistical Machine Translation SMT . In prior work Lee 2004 Habash and Sadat 2006 it has been shown that morphological segmentation of the Arabic source benefits the performance of Arabic-to-English SMT. The use of similar techniques for English-to-Arabic SMT requires recombination of the target side into valid surface forms which is not a trivial task. In this paper we present an initial set of experiments on English-to-Arabic SMT. We report results from two domains text news trained on a large corpus and spoken travel conversation trained on a significantly smaller corpus. We show that segmenting the Arabic target in training and decoding improves performance. We propose various schemes for recombining the segmented Arabic and compare their effect on translation. We also report on applying Factored Translation Models Koehn and Hoang 2007 for English-to-Arabic translation. 2 Previous Work The only previous work on English-to-Arabic SMT that we are aware of is by Sarikaya .

TÀI LIỆU LIÊN QUAN