Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Hindi-to-Urdu Machine Translation Through Transliteration"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. | Hindi-to-Urdu Machine Translation Through Transliteration Nadir Durrani Hassan Sajjad Alexander Fraser Helmut Schmid Institute for Natural Language Processing University of Stuttgart durrani sajjad fraser schmid @ims.uni-stuttgart.de Abstract We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models based on conditional and joint probability formulations that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV out-of-vocabulary words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 conditional probability model and 19.00 joint probability model as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. 1 Introduction Hindi is an official language of India and is written in Devanagari script. Urdu is the national language of Pakistan and also one of the state languages in India and is written in Perso-Arabic script. Hindi inherits its vocabulary from Sanskrit while Urdu descends from several languages including Arabic Farsi Persian Turkish and Sanskrit. Hindi and Urdu share grammatical structure and a large proportion of vocabulary that they both inherited from Sanskrit. Most of the verbs and closed-class words pronouns auxiliaries casemarkers etc are the same. Because both languages have lived together for centuries some Urdu words which originally came from Arabic and Farsi have also mixed into Hindi and are now part of the Hindi vocabulary. The spoken