tailieunhanh - Báo cáo khoa học: "A Modified Joint Source-Channel Model for Transliteration"

Most machine transliteration systems transliterate out of vocabulary (OOV) words through intermediate phonemic mapping. A framework has been presented that allows direct orthographical mapping between two languages that are of different origins employing different alphabet sets. A modified joint source–channel model along with a number of alternatives have been proposed. Aligned transliteration units along with their context are automatically derived from a bilingual training corpus to generate the collocational statistics. . | A Modified Joint Source-Channel Model for Transliteration Asif Ekbal Comp. Sc. Engg. Deptt. Jadavpur University India ekbal_asif12@ Sudip Kumar Naskar Comp. Sc. Engg. Deptt. Jadavpur University India sudip_naskar@ Sivaji Bandyopadhyay Comp. Sc. Engg. Deptt. Jadavpur University India sivaji_cse_ju@ Abstract Most machine transliteration systems transliterate out of vocabulary OOV words through intermediate phonemic mapping. A framework has been presented that allows direct orthographical mapping between two languages that are of different origins employing different alphabet sets. A modified joint source-channel model along with a number of alternatives have been proposed. Aligned transliteration units along with their context are automatically derived from a bilingual training corpus to generate the collocational statistics. The transliteration units in Bengali words take the pattern C M where C represents a vowel or a consonant or a conjunct and M represents the vowel modifier or matra. The English transliteration units are of the form C V where C represents a consonant and V represents a vowel. A Bengali-English machine transliteration system has been developed based on the proposed models. The system has been trained to transliterate person names from Bengali to English. It uses the linguistic knowledge of possible conjuncts and diphthongs in Bengali and their equivalents in English. The system has been evaluated and it has been observed that the modified joint source-channel model performs best with a Word Agreement Ratio of and a Transliteration Unit Agreement Ratio of . 1 Introduction In Natural Language Processing NLP application areas such as information retrieval question answering systems and machine translation there is an increasing need to translate OOV words from one language to another. They are translated through transliteration the method of translating into another language by expressing the original .