tailieunhanh - Báo cáo khoa học: "Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back-Transliteration"

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our new model improves the English to Persian transliteration accuracy by 14% over an n-gram baseline. We also propose a novel back-transliteration method for this language pair, a previously unstudied problem. | Collapsed Consonant and Vowel Models New Approaches for English-Persian Transliteration and Back-Transliteration Sarvnaz Karimi Falk Scholer Andrew Turpin School of Computer Science and Information Technology RMIT University GPO Box 2476V Melbourne 3001 Australia sarvnaz fscholer aht @ Abstract We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast we introduce an alignment algorithm particularly designed for transliteration. Our new model improves the English to Persian transliteration accuracy by 14 over an n-gram baseline. We also propose a novel back-transliteration method for this language pair a previously unstudied problem. Experimental results demonstrate that our algorithm leads to an absolute improvement of 25 over standard transliteration approaches. 1 Introduction Translation of a text from a source language to a target language requires dealing with technical terms and proper names. These occur in almost any text but rarely appear in bilingual dictionaries. The solution is the transliteration of such out-ofdictionary terms a word from the source language is transformed to a word in the target language preserving its pronunciation. Recovering the original word from the transliterated target is called back-transliteration. Automatic transliteration is important for many different applications including machine translation cross-lingual information retrieval and cross-lingual question answering. Transliteration methods can be categorized into grapheme-based AbdulJaleel and Larkey 2003 Li 648 et al. 2004 phoneme-based Knight and Graehl 1998 Jung et al. 2000 and combined Bilac and Tanaka 2005 approaches. Grapheme-based methods perform a direct orthographical mapping between source and target words while phonemebased approaches use an intermediate phonetic representation. Both grapheme- or phoneme-based methods usually begin by .