tailieunhanh - Báo cáo khoa học: "Semi-Supervised Sequential Labeling and Segmentation using Giga-word Scale Unlabeled Data"

This paper provides evidence that the use of more unlabeled data in semi-supervised learning can improve the performance of Natural Language Processing (NLP) tasks, such as part-of-speech tagging, syntactic chunking, and named entity recognition. We first propose a simple yet powerful semi-supervised discriminative model appropriate for handling large scale unlabeled data. Then, we describe experiments performed on widely used test collections, namely, PTB III data, CoNLL’00 and ’03 shared task data for the above three NLP tasks, respectively. . | Semi-Supervised Sequential Labeling and Segmentation using Giga-word Scale Unlabeled Data Jun Suzuki and Hideki Isozaki NTT Communication Science Laboratories NTT Corp. 2-4 Hikaridai Seika-cho Soraku-gun Kyoto 619-0237 Japan jun isozaki @ Abstract This paper provides evidence that the use of more unlabeled data in semi-supervised learning can improve the performance of Natural Language Processing NLP tasks such as part-of-speech tagging syntactic chunking and named entity recognition. We first propose a simple yet powerful semi-supervised discriminative model appropriate for handling large scale unlabeled data. Then we describe experiments performed on widely used test collections namely PTB III data CoNLL 00 and 03 shared task data for the above three NLP tasks respectively. We incorporate up to 1G-words one billion tokens of unlabeled data which is the largest amount of unlabeled data ever used for these tasks to investigate the performance improvement. In addition our results are superior to the best reported results for all of the above test collections. 1 Introduction Today we can easily find a large amount of unlabeled data for many supervised learning applications in Natural Language Processing NLP . Therefore to improve performance the development of an effective framework for semi-supervised learning SSL that uses both labeled and unlabeled data is attractive for both the machine learning and NLP communities. We expect that such SSL will replace most supervised learning in real world applications. In this paper we focus on traditional and important NLP tasks namely part-of-speech POS tagging syntactic chunking and named entity recognition NER . These are also typical supervised learning applications in NLP and are referred to as sequential labeling and segmentation problems. In some cases these tasks have relatively large amounts of labeled training data. In this situation supervised learning can provide competitive results and it is .

TÀI LIỆU LIÊN QUAN