tailieunhanh - Báo cáo khoa học: "Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary"

This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction, we also extract words in Mongolian corpora that are phonetically similar to Japanese Katakana words as loanwords. In addition, we correspond the extracted loanwords to Japanese words and produce a bilingual dictionary. We propose a stemming method for Mongolian to extract loanwords correctly. We verify the effectiveness of our methods experimentally. . | Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary Badam-Osor Khaltar Graduate School of Library Information and Media Studies University of Tsukuba 1-2 Kasuga Tsukuba 305-8550 Japan khab23@ Atsushi Fujii Graduate School of Library Information and Media Studies University of Tsukuba 1-2 Kasuga Tsukuba 305-8550 Japan fujii@ Tetsuya Ishikawa The Historiographical Institute The University of Tokyo 3-1 Hongo 7-chome Bunkyo-ku Tokyo 133-0033 Japan ishikawa@ Abstract This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction we also extract words in Mongolian corpora that are phonetically similar to Japanese Katakana words as loanwords. In addition we correspond the extracted loanwords to Japanese words and produce a bilingual dictionary. We propose a stemming method for Mongolian to extract loanwords correctly. We verify the effectiveness of our methods experimentally. 1 Introduction Reflecting the rapid growth in science and technology new words and technical terms are being progressively created and these words and terms are often transliterated when imported as loanwords in another language. Loanwords are often not included in dictionaries and decrease the quality of natural language processing information retrieval machine translation and speech recognition. At the same time compiling dictionaries is expensive because it relies on human introspection and supervision. Thus a number of automatic methods have been proposed to extract loanwords and their translations from corpora targeting various languages. In this paper we focus on extracting loanwords in Mongolian. The Mongolian language is divided into Traditional Mongolian written using the Mongolian alphabet and Modern

TÀI LIỆU LIÊN QUAN