tailieunhanh - Báo cáo khoa học: "Machine Translation without Words through Substring Alignment"

In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. | Machine Translation without Words through Substring Alignment Graham Neubig1 2 Taro Watanabe2 Shinsuke Mori1 Tatsuya Kawahara1 Graduate School of Informatics Kyoto University Yoshida Honmachi Sakyo-ku Kyoto Japan 2National Institute of Information and Communication Technology 3-5 Hikari-dai Seika-cho Soraku-gun Kyoto Japan Abstract In this paper we demonstrate that accurate machine translation is possible without the concept of words treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs. 1 Introduction Traditionally the task of statistical machine translation SMT is defined as translating a source sentence f 1 f1 . fJ to a target sentence e e1 . eI where each element of fJ and e is assumed to be a word in the source and target languages. However the definition of a word is often problematic. The most obvious example of this lies in languages that do not separate words with white space such as Chinese Japanese or Thai in which the choice of a segmentation standard has a large effect on translation accuracy Chang et al. 2008 . Even for languages with explicit word The first author is now affiliated with the Nara Institute of Science and Technology. 165 boundaries all machine translation systems perform at least some precursory form of tokenization splitting punctuation and words to prevent the sparsity that would occur if punctuated and non-punctuated words were treated as different entities. Sparsity also

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN