tailieunhanh - Báo cáo khoa học: "Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists"

Current Named Entity Recognition systems suffer from the lack of hand-tagged data as well as degradation when moving to other domain. This paper explores two aspects: the automatic generation of gazetteer lists from unlabeled data; and the building of a Named Entity Recognition system with labeled and unlabeled data. | Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists Zornitsa Kozareva Dept. de Lenguajes y Sistemas Informaticos University of Alicante Alicante Spain zkozareva@ Abstract Current Named Entity Recognition systems suffer from the lack of hand-tagged data as well as degradation when moving to other domain. This paper explores two aspects the automatic generation of gazetteer lists from unlabeled data and the building of a Named Entity Recognition system with labeled and unlabeled data. 1 Introduction Automatic information extraction and information retrieval concerning particular person location organization title of movie or book juxtaposes to the Named Entity Recognition NER task. NER consists in detecting the most silent and informative elements in a text such as names of people company names location monetary currencies dates. Early NER systems Fisher et al. 1997 Black et al. 1998 etc. participating in Message Understanding Conferences MUC used linguistic tools and gazetteer lists. However these are difficult to develop and domain sensitive. To surmount these obstacles application of machine learning approaches to NER became a research subject. Various state-of-the-art machine learning algorithms such as Maximum Entropy Borthwick 1999 AdaBoost Carreras et al. 2002 Hidden Markov Models Bikel et al. Memory-based Based learning Tjong Kim Sang 2002b have been used1. Klein et al. 2003 Mayfield et al. 2003 Wu et al. 2003 Kozareva et al. 2005c among others combined several classifiers to obtain better named entity coverage rate. 1For other machine learning methods consult http conll2002 ner http conll2003 ner Nevertheless all these machine learning algorithms rely on previously hand-labeled training data. Obtaining such data is labor-intensive time consuming and even might not be present for languages with limited funding. Resource limitation directed NER research Collins and Singer 1999 .

TỪ KHÓA LIÊN QUAN