tailieunhanh - Báo cáo khoa học: "A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora"

News stories are typically rich in NEs and therefore, comparable news corpora can be expected to contain NETEs (Klementiev and Roth, 2006; Tao et al., 2006). The large quantity and the perpetual availability of news corpora in many of the world’s languages, make mining of NETEs a viable alternative to traditional approaches. It is this opportunity that we address in our work. In this paper, we detail an effective and scalable mining method, called MINT (MIning Named-entity Transliteration equivalents), for mining of NETEs from large comparable corpora. . | MINT A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora Raghavendra Udupa K Saravanan A Kumaran Jagadeesh Jagarlamudi Microsoft Research India Bangalore 560080 INDIA raghavu v-sarak kumarana jags @ Abstract In this paper we address the problem of mining transliterations of Named Entities NEs from large comparable corpora. We leverage the empirical fact that multilingual news articles with similar news content are rich in Named Entity Transliteration Equivalents NETEs . Our mining algorithm MINT uses a cross-language document similarity model to align multilingual news articles and then mines NETEs from the aligned articles using a transliteration similarity model. We show that our approach is highly effective on 6 different comparable corpora between English and 4 languages from 3 different language families. Furthermore it performs substantially better than a state-of-the-art competitor. 1 Introduction Named Entities NEs play a critical role in many Natural Language Processing and Information Retrieval IR tasks. In Cross-Language Information Retrieval CLIR systems they play an even more important role as the accuracy of their transliterations is shown to correlate highly with the performance of the CLIR systems Mandl and Womser-Hacker 2005 Xu and Weischedel 2005 . Traditional methods for transliterations have not proven to be very effective in CLIR. Machine Transliteration systems AbdulJaleel and Larkey 2003 Al-Onaizan and Knight 2002 Virga and Khudanpur 2003 usually produce incorrect transliterations and translation lexcions such as hand-crafted or statistical dictionaries are too static to have good coverage of NEs 1 occurring in the current news events. Hence there is a critical need for creating and continually updat Currently with University of Utah. 1 New NEs are introduced to the vocabulary of a language every day. On an average 260 and 452 new NEs appeared daily in the XIE and AFE segments

TỪ KHÓA LIÊN QUAN