tailieunhanh - Báo cáo khoa học: "Learning to Find Translations and Transliterations on the Web"

In this paper, we present a new method for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. | Learning to Find Translations and Transliterations on the Web Joseph Z. Chang Department of Computer Science National Tsing Hua University 101 Kuangfu Road Hsinchu 300 Taiwan j Jason S. Chang Department of Computer Science National Tsing Hua University 101 Kuangfu Road Hsinchu 300 Taiwan j schang@ Jyh-Shing Roger Jang Department of Computer Science National Tsing Hua University 101 Kuangfu Road Hsinchu 300 Taiwan jang@ Abstract In this paper we present a new method for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features resulting in a system that outperforms previous work. 1 Introduction The phrase translation problem is critical to machine translation cross-lingual information retrieval and multilingual terminology Bian and Chen 2000 Kupiec 1993 . Such systems typically use a parallel corpus. However the out of vocabulary problem OOV is hard to overcome even with a very large training corpus due to the Zipf nature of word distribution and ever growing new terminology and named entities. Luckily there are an abundant of webpages consisting mixed-code text typically written in one language but interspersed with some sentential or phrasal translations in another language. By retrieving and identifying such translation counterparts on the Web we can cope with the OOV problem. Consider the technical term named-entity recognition. The best places to find the Chinese translations for named-entity recognition are probably not some parallel corpus or dictionary but rather mixed-code webpages. The

TỪ KHÓA LIÊN QUAN