tailieunhanh - Báo cáo khoa học: "Dependency Treelet Translation: Syntactically Informed Phrasal SMT"

We describe a novel approach to statistical machine translation that combines syntactic information in the source language with recent advances in phrasal translation. This method requires a source-language dependency parser, target language word segmentation and an unsupervised word alignment component. We align a parallel corpus, project the source dependency parse onto the target sentence, extract dependency treelet translation pairs, and train a tree-based ordering model. | Dependency Treelet Translation Syntactically Informed Phrasal SMT Chris Quirk Arul Menezes Microsoft Research One Microsoft Way Redmond WA 98052 chrisq arulm @ Abstract We describe a novel approach to statistical machine translation that combines syntactic information in the source language with recent advances in phrasal translation. This method requires a source-language dependency parser target language word segmentation and an unsupervised word alignment component. We align a parallel corpus project the source dependency parse onto the target sentence extract dependency treelet translation pairs and train a tree-based ordering model. We describe an efficient decoder and show that using these treebased models in combination with conventional SMT models provides a promising approach that incorporates the power of phrasal SMT with the linguistic generality available in a parser. 1. Introduction Over the past decade we have witnessed a revolution in the field of machine translation MT toward statistical or corpus-based methods. Yet despite this success statistical machine translation SMT has many hurdles to overcome. While it excels at translating domain-specific terminology and fixed phrases grammatical generalizations are poorly captured and often mangled during translation Thurmair 04 . . Limitations of string-based phrasal SMT State-of-the-art phrasal SMT systems such as Koehn et al. 03 and Vogel et al. 03 model translations of phrases here strings of adjacent words not syntactic constituents rather than individual words. Arbitrary reordering of words is allowed within memorized phrases but typically Colin Cherry University of Alberta Edmonton Alberta Canada T6G 2E1 colinc@ only a small amount of phrase reordering is allowed modeled in terms of offset positions at the string level. This reordering model is very limited in terms of linguistic generalizations. For instance when translating English to Japanese an ideal system would .