tailieunhanh - Báo cáo khoa học: "Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web"

In this paper, we present a method that automatically constructs a Named Entity (NE) tagged corpus from the web to be used for learning of Named Entity Recognition systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through sentence separation and text refinement procedures and NE instances are finally tagged with the appropriate NE categories. | Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web Joohui An Dept. of CSE POSTECH Pohang Korea 790-784 minnie@ Seungwoo Lee Dept. of CSE POSTECH Pohang Korea 790-784 pinesnow@ Gary Geunbae Lee Dept. of CSE POSTECH Pohang Korea 790-784 gblee@ Abstract In this paper we present a method that automatically constructs a Named Entity NE tagged corpus from the web to be used for learning of Named Entity Recognition systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through sentence separation and text refinement procedures and NE instances are finally tagged with the appropriate NE categories. Our experiments demonstrates that the suggested method can acquire enough NE tagged corpus equally useful to the manually tagged one without any human intervention. 1 Introduction Current trend in Named Entity Recognition NER is to apply machine learning approach which is more attractive because it is trainable and adaptable and subsequently the porting of a machine learning system to another domain is much easier than that of a rule-based one. Various supervised learning methods for Named Entity NE tasks were successfully applied and have shown reasonably satisfiable per-formance. Zhou and Su 2002 Borthwick et al. 1998 Sassano and Utsuro 2000 However most of these systems heavily rely on a tagged corpus for training. For a machine learning approach a large corpus is required to circumvent the data sparseness problem but the dilemma is that the costs required to annotate a large training corpus are non-trivial. In this paper we suggest a method that automatically constructs an NE tagged corpus from the web to be used for learning of NER systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through the sentence separation and text refinement procedures and NE .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.