tailieunhanh - Báo cáo khoa học: "Learning Transliteration Lexicons from the Web"
This paper presents an adaptive learning framework for Phonetic Similarity Modeling (PSM) that supports the automatic construction of transliteration lexicons. The learning algorithm starts with minimum prior knowledge about machine transliteration, and acquires knowledge iteratively from the Web. We study the active learning and the unsupervised learning strategies that minimize human supervision in terms of data labeling. The learning process refines the PSM and constructs a transliteration lexicon at the same time. . | Learning Transliteration Lexicons from the Web Jin-Shea Kuo1 2 1Chung-Hwa Telecom. Laboratories Taiwan jskuo@ Haizhou Li Institute for Infocomm Research Singapore hzli@ Ying-Kuei Yang2 National Taiwan University of Science and Technology Taiwan ykyang@. Abstract This paper presents an adaptive learning framework for Phonetic Similarity Modeling PSM that supports the automatic construction of transliteration lexicons. The learning algorithm starts with minimum prior knowledge about machine transliteration and acquires knowledge iteratively from the Web. We study the active learning and the unsupervised learning strategies that minimize human supervision in terms of data labeling. The learning process refines the PSM and constructs a transliteration lexicon at the same time. We evaluate the proposed PSM and its learning algorithm through a series of systematic experiments which show that the proposed framework is reliably effective on two independent databases. 1 Introduction In applications such as cross-lingual information retrieval CLIR and machine translation MT there is an increasing need to translate out-ofvocabulary OOV words for example from an alphabetical language to Chinese. Foreign proper names constitute a good portion of OOV words which are translated into Chinese through transliteration. Transliteration is a process of translating a foreign word into a native language by preserving its pronunciation in the original language otherwise known as translation-by-sound. MT and CLIR systems rely heavily on bilingual lexicons which are typically compiled manually. However in view of the current information explosion it is labor intensive if not impossible to compile a complete proper nouns lexicon. The Web is growing at a fast pace and is providing a live information source that is rich in transliterations. This paper presents a novel solution for automatically constructing an English-Chinese transliteration lexicon from
đang nạp các trang xem trước