tailieunhanh - Báo cáo khoa học: "Active Sample Selection for Named Entity Transliteration"

This paper introduces a new method for identifying named-entity (NE) transliterations within bilingual corpora. Current state-of-theart approaches usually require annotated data and relevant linguistic knowledge which may not be available for all languages. We show how to effectively train an accurate transliteration classifier using very little data, obtained automatically. To perform this task, we introduce a new active sampling paradigm for guiding and adapting the sample selection process. We also investigate how to improve the classifier by identifying repeated patterns in the training data. We evaluated our approach using English, Russian and Hebrew corpora. . | Active Sample Selection for Named Entity Transliteration Dan Goldwasser Dan Roth Department of Computer Science University of Illinois Urbana IL 61801 goldwas1 danr @ Abstract This paper introduces a new method for identifying named-entity NE transliterations within bilingual corpora. Current state-of-the-art approaches usually require annotated data and relevant linguistic knowledge which may not be available for all languages. We show how to effectively train an accurate transliteration classifier using very little data obtained automatically. To perform this task we introduce a new active sampling paradigm for guiding and adapting the sample selection process. We also investigate how to improve the classifier by identifying repeated patterns in the training data. We evaluated our approach using English Russian and Hebrew corpora. 1 Introduction This paper presents a new approach for constructing a discriminative transliteration model. Our approach is fully automated and requires little knowledge of the source and target languages. Named entity NE transliteration is the process of transcribing a NE from a source language to a target language based on phonetic similarity between the entities. Figure 1 provides examples of NE transliterations in English Russian and Hebrew. Identifying transliteration pairs is an important component in many linguistic applications such as machine translation and information retrieval which require identifying out-of-vocabulary words. In our settings we have access to source language NE and the ability to label the data upon request. We introduce a new active sampling paradigm that English NE Russian NE Hebrew NE Saint Petersburg CaHKT neTepõypr topao ma-itos Figure 1 NE in English Russian and Hebrew. aims to guide the learner toward informative samples allowing learning from a small number of representative examples. After the data is obtained it is analyzed to identify repeating patterns which can be used to focus the .