tailieunhanh - Báo cáo khoa học: "Hindi-to-Urdu Machine Translation Through Transliteration"

We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. | Hindi-to-Urdu Machine Translation Through Transliteration Nadir Durrani Hassan Sajjad Alexander Fraser Helmut Schmid Institute for Natural Language Processing University of Stuttgart durrani sajjad fraser schmid @ Abstract We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models based on conditional and joint probability formulations that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV out-of-vocabulary words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of conditional probability model and joint probability model as compared to for a baseline phrase-based system and for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. 1 Introduction Hindi is an official language of India and is written in Devanagari script. Urdu is the national language of Pakistan and also one of the state languages in India and is written in Perso-Arabic script. Hindi inherits its vocabulary from Sanskrit while Urdu descends from several languages including Arabic Farsi Persian Turkish and Sanskrit. Hindi and Urdu share grammatical structure and a large proportion of vocabulary that they both inherited from Sanskrit. Most of the verbs and closed-class words pronouns auxiliaries casemarkers etc are the same. Because both languages have lived together for centuries some Urdu words which originally came from Arabic and Farsi have also mixed into Hindi and are now part of the Hindi vocabulary. The spoken

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN