Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Named Entity recognition (NER) is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an (almost) unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in a resource free language, given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. | Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora Alexandre Klementiev Dan Roth Dept. of Computer Science University of Illinois Urbana IL 61801 klementi danr @uiuc.edu Abstract Named Entity recognition NER is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However many languages lack such resources. This paper presents an almost unsupervised learning algorithm for automatic discovery of Named Entities NEs in a resource free language given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. NEs have similar time distributions across such corpora and often some of the tokens in a multi-word NE are transliterated. We develop an algorithm that exploits both observations iteratively. The algorithm makes use of a new frequency based metric for time distributions and a resource free discriminative approach to transliteration. Seeded with a small number of transliteration pairs our algorithm discovers multi-word NEs and takes advantage of a dictionary if one exists to account for translated or partially translated NEs. We evaluate the algorithm on an English-Russian corpus and show high level of NEs discovery in Russian. 1 Introduction Named Entity recognition has been getting much attention in NLP research in recent years since it is seen as significant component of higher level NLP tasks such as information distillation and question answering. Most successful approaches to NER employ machine learning techniques which require supervised training data. However for many languages these resources do not exist. Moreover it is often difficult to find experts in these languages both for the expensive annotation effort and even for language specific clues. On the other hand comparable multilingual data such as multilingual news streams are becoming increasingly available see section 4 . In .