tailieunhanh - Báo cáo khoa học: "Recall-Oriented Learning of Named Entities in Arabic Wikipedia"

We consider the problem of NER in Arabic Wikipedia, a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation, we obtain annotations for articles in four topical groups, allowing annotators to identify domain-specific entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modification to the online learner—a loss function encouraging it to “arrogantly” favor recall over precision— substantially improves recall and F1 . . | Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit Nathan Schneider Rishav Bhowmick Kemal Oflazer Noah A. Smith School of Computer Science Carnegie Mellon University . Box 24866 Doha Qatar Pittsburgh PA 15213 USA behrang@ nschneid@cs. rishavb@qatar. ko@cs. nasmith@cs. Abstract We consider the problem of NER in Arabic Wikipedia a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation we obtain annotations for articles in four topical groups allowing annotators to identify domain-specific entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modification to the online learner a loss function encouraging it to arrogantly favor recall over precision substantially improves recall and Fl. We then adapt our model with self-training on unlabeled target-domain data enforcing the same recall-oriented bias in the selftraining stage yields marginal 1 Introduction This paper considers named entity recognition NER in text that is different from most past research on NER. Specifically we consider Arabic Wikipedia articles with diverse topics beyond the commonly-used news domain. These data challenge past approaches in two ways First Arabic is a morphologically rich language Habash 2010 . Named entities are referenced using complex syntactic constructions cf. English NEs which are primarily sequences of proper nouns . The Arabic script suppresses most vowels increasing lexical ambiguity and lacks capitalization a key clue for English NER. Second much research has focused on the use of news text for system building and evaluation. Wikipedia articles are not news belonging instead to a wide range of domains that are not clearly 1The annotated dataset and a supplementary document with additional details of this work can be found at http .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.