Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Updating a Name Tagger Using Contemporary Unlabeled Data"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
For many NLP tasks, including named entity tagging, semi-supervised learning has been proposed as a reasonable alternative to methods that require annotating large amounts of training data. In this paper, we address the problem of analyzing new data given a semi-supervised NE tagger trained on data from an earlier time period. We will show that updating the unlabeled data is sufficient to maintain quality over time, and outperforms updating the labeled data. | Updating a Name Tagger Using Contemporary Unlabeled Data Cristina Mota L2F INESC-ID 1ST NYU Rua Alves Redol 9 1000-029 Lisboa Portugal cmota@ist.utl.pt Ralph Grishman New York University Computer Science Department NeW York NY 10003 USA grishman@cs.nyu.edu Abstract For many NLP tasks including named entity tagging semi-supervised learning has been proposed as a reasonable alternative to methods that require annotating large amounts of training data. In this paper we address the problem of analyzing new data given a semi-supervised NE tagger trained on data from an earlier time period. We will show that updating the unlabeled data is sufficient to maintain quality over time and outperforms updating the labeled data. Furthermore we will also show that augmenting the unlabeled data with older data in most cases does not result in better performance than simply using a smaller amount of current unlabeled data. 1 Introduction Brill 2003 observed large gains in performance for different NLP tasks solely by increasing the size of unlabeled data but stressed that for other NLP tasks such as named entity recognition NER we still need to focus on developing tools that help to increase the size of annotated data. This problem is particularly crucial when processing languages such as Portuguese for which the labeled data is scarce. For instance in the first NER evaluation for Portuguese HAREM Santos and Cardoso 2007 only two out of the nine participants presented systems based on machine learning and they both argued they could have achieved significantly better results if they had larger training sets. Semi-supervised methods are commonly chosen as an alternative to overcome the lack of annotated resources because they present a good trade-off between amount of labeled data needed and performance achieved. Co-training is one of those methods and has been extensively studied in NLP Nigam and Ghani 2000 Pierce and Cardie 2001 Ng and Cardie 2003 Mota and Grishman 2008 . In .