tailieunhanh - Báo cáo khoa học: "Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation"

Multilingual applications frequently involve dealing with proper names, but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese, Japanese and Korean (CJK) where simple string copying is not a solution. We present a novel approach for generating the ideographic representations of a CJK name written in a Latin script. | Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation Yan Qu Clairvoyance Corporation 5001 Baum Boulevard Suite 700 Pittsburgh PA 15213-1854 USA yqu@clairvoyancecorp. com Abstract Multilingual applications frequently involve dealing with proper names but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese Japanese and Korean CJK where simple string copying is not a solution. We present a novel approach for generating the ideographic representations of a CJK name written in a Latin script. The proposed approach involves first identifying the origin of the name and then back-transliterating the name to all possible Chinese characters using language-specific mappings. To reduce the massive number of possibilities for computation we apply a three-tier filtering process by filtering first through a set of attested bigrams then through a set of attested terms and lastly through the WWW for a final validation. We illustrate the approach with English-to-Japanese back-transliteration. Against test sets of Japanese given names and surnames we have achieved average precisions of 73 and 90 respectively. 1 Introduction Multilingual processing in the real world often involves dealing with proper names. Translations of names however are often missing in bilingual resources. This absence adversely affects multilingual applications such as machine translation MT or cross language information retrieval CLIR for which names are generally good discriminating terms for high IR performance Lin et al. 2003 . For language pairs with different writing systems such as Japanese and English and for which simple string-copying of a name from one language to another is not a solution researchers have studied techniques for transliteration . phonetic translation across languages. For example