tailieunhanh - Báo cáo khoa học: "Improving Arabic-to-English Statistical Machine Translation by Reordering Post-verbal Subjects for Alignment"

We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. | Improving Arabic-to-English Statistical Machine Translation by Reordering Post-verbal Subjects for Alignment Marine Carpuat Yuval Marton Nizar Habash Columbia University Center for Computational Learning Systems 475 Riverside Drive New York NY 10115 marine ymarton habash @ Abstract We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation SMT . We show that post-verbal subject VS constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores even on a strong large-scale baseline and despite noisy parses. 1 Introduction Modern Standard Arabic MSA is a morpho-syntactically complex language with different phenomena from English a fact that raises many interesting issues for natural language processing and Arabic-to-English statistical machine translation SMT . While comprehensive Arabic preprocessing schemes have been widely adopted for handling Arabic morphology in SMT . Sadat and Habash 2006 Zollmann et al. 2006 Lee 2004 syntactic issues have not received as much attention by comparison Green et al. 2009 Crego and Habash 2008 Habash 2007 . Arabic verbal constructions are particularly challenging since subjects can occur in pre-verbal SV post-verbal VS or pro-dropped null subject constructions. As a result training data for learning verbal construction translations is split between the different constructions and their patterns and complex reordering schemas are needed in order to translate them into primarily pre-verbal subject languages SVO such as English. These issues are particularly problematic in .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN