tailieunhanh - Báo cáo khoa học: "A Joint Source-Channel Model for Machine Transliteration"

Most foreign names are transliterated into Chinese, Japanese or Korean with approximate phonetic equivalents. The transliteration is usually achieved through intermediate phonemic mapping. This paper presents a new framework that allows direct orthographical mapping (DOM) between two different languages, through a joint source-channel model, also called n-gram transliteration model (TM). | A Joint Source-Channel Model for Machine Transliteration Li Haizhou Zhang Min Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore 119613 hli sujian mzhang @ Abstract Most foreign names are transliterated into Chinese Japanese or Korean with approximate phonetic equivalents. The transliteration is usually achieved through intermediate phonemic mapping. This paper presents a new framework that allows direct orthographical mapping DOM between two different languages through a joint source-channel model also called n-gram transliteration model TM . With the n-gram TM model we automate the orthographic alignment process to derive the aligned transliteration units from a bilingual dictionary. The n-gram TM under the DOM framework greatly reduces system development effort and provides a quantum leap in improvement in transliteration accuracy over that of other state-of-the-art machine learning algorithms. The modeling framework is validated through several experiments for English-Chinese language pair. 1 Introduction In applications such as cross-lingual information retrieval CLIR and machine translation there is an increasing need to translate out-of-vocabulary words from one language to another especially from alphabet language to Chinese Japanese or Korean. Proper names of English French German Russian Spanish and Arabic origins constitute a good portion of out-of-vocabulary words. They are translated through transliteration the method of translating into another language by preserving how words sound in their original languages. For writing foreign names in Chinese transliteration always follows the original romanization. Therefore any foreign name will have only one Pinyin romanization of Chinese and thus in Chinese characters. In this paper we focus on automatic Chinese transliteration of foreign alphabet names. Because some alphabet writing systems use various diacritical marks we find it more practical to write names .

TÀI LIỆU LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
10    179    3    28-12-2024
41    188    5    28-12-2024
65    142    1    28-12-2024