tailieunhanh - Báo cáo khoa học: "Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets"

Creating large amounts of annotated data to train statistical PCFG parsers is expensive, and the performance of such parsers declines when training and test data are taken from different domains. In this paper we use selftraining in order to improve the quality of a parser and to adapt it to a different domain, using only small amounts of manually annotated seed data. We report significant improvement both when the seed and test data are in the same domain and in the outof-domain adaptation scenario. . | Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets Roi Reichart ICNC Hebrew University of Jerusalem roiri@ Ari Rappoport Institute of Computer Science Hebrew University of Jerusalem arir@ Abstract Creating large amounts of annotated data to train statistical PCFG parsers is expensive and the performance of such parsers declines when training and test data are taken from different domains. In this paper we use selftraining in order to improve the quality of a parser and to adapt it to a different domain using only small amounts of manually annotated seed data. We report significant improvement both when the seed and test data are in the same domain and in the out-of-domain adaptation scenario. In particular we achieve 50 reduction in annotation cost for the in-domain case yielding an improvement of 66 over previous work and a 20-33 reduction for the domain adaptation case. This is the first time that self-training with small labeled datasets is applied successfully to these tasks. We were also able to formulate a characterization of when selftraining is valuable. 1 Introduction State of the art statistical parsers Collins 1999 Charniak 2000 Koo and Collins 2005 Charniak and Johnson 2005 are trained on manually annotated treebanks that are highly expensive to create. Furthermore the performance of these parsers decreases as the distance between the genres of their training and test data increases. Therefore enhancing the performance of parsers when trained on small manually annotated datasets is of great importance both when the seed and test data are taken 616 from the same domain the in-domain scenario and when they are taken from different domains the out-of-domain or parser adaptation scenario . Since the problem is the expense in manual annotation we define small to be sentences which are the sizes of sentence sets that can be manually annotated by constituent structure in a

TÀI LIỆU LIÊN QUAN