tailieunhanh - Báo cáo khoa học: "Analysing Wikipedia and Gold-Standard Corpora for NER Training"
Named entity recognition (NER) for English typically involves one of three gold standards: MUC, CoNLL, or BBN, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text. We present the first comprehensive crosscorpus evaluation of NER. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corpora on crosscorpus evaluation by up to 11%. . | Analysing Wikipedia and Gold-Standard Corpora for NER Training Joel Nothman and Tara Murphy and James R. Curran School of Information Technologies University of Sydney NSW 2006 Australia jnot4610 tm james @ Abstract Named entity recognition NER for English typically involves one of three gold standards MUC CoNLL or BBN all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text. We present the first comprehensive crosscorpus evaluation of NER. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process we develop a Wikipedia corpus which outperforms gold standard corpora on crosscorpus evaluation by up to 11 . 1 Introduction Named Entity Recognition NER the task of identifying and classifying the names of people organisations and other entities within text is central to many NLP systems. NER developed from information extraction in the Message Understanding Conferences MUC of the 1990s. By MUC 6 and 7 NER had become a distinct task tagging proper names and temporal and numerical expressions Chinchor 1998 . Statistical machine learning systems have proven successful for NER. These learn patterns associated with individual entity classes making use of many contextual orthographic linguistic and external knowledge features. However they rely heavily on large annotated training corpora. This need for costly expert annotation hinders the creation of more task-adaptable high-performance named entity recognisers. In acquiring new sources for annotated corpora we require an analysis of training data as a variable in NER. This paper compares the three main gold-standard corpora. We found that tagging mod els built on each corpus perform relatively poorly when tested on the others. We therefore present three methods for analysing internal and intercorpus inconsistencies. Our analysis demonstrates that seemingly .
đang nạp các trang xem trước