Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We investigate the effect of corpus size in combining supervised and unsupervised learning for two types of attachment decisions: relative clause attachment and prepositional phrase attachment. The supervised component is Collins’ parser, trained on the Wall Street Journal. The unsupervised component gathers lexical statistics from an unannotated corpus of newswire text. We find that the combined system only improves the performance of the parser for small training sets. Surprisingly, the size of the unannotated corpus has little effect due to the noisiness of the lexical statistics acquired by unsupervised learning. . | The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation Michaela Atterer Institute for NLP University of Stuttgart atterer@ims.uni-stuttgart.de Hinrich Schutze Institute for NLP University of Stuttgart hinrich@hotmail.com Abstract We investigate the effect of corpus size in combining supervised and unsupervised learning for two types of attachment decisions relative clause attachment and prepositional phrase attachment. The supervised component is Collins parser trained on the Wall Street Journal. The unsupervised component gathers lexical statistics from an unannotated corpus of newswire text. We find that the combined system only improves the performance of the parser for small training sets. Surprisingly the size of the unannotated corpus has little effect due to the noisiness of the lexical statistics acquired by unsupervised learning. 1 Introduction The best performing systems for many tasks in natural language processing are based on supervised training on annotated corpora such as the Penn Treebank Marcus et al. 1993 and the prepositional phrase data set first described in Ratnaparkhi et al. 1994 . However the production of training sets is expensive. They are not available for many domains and languages. This motivates research on combining supervised with unsupervised learning since unannotated text is in ample supply for most domains in the major languages of the world. The question arises how much annotated and unannotated data is necessary in combination learning strategies. We investigate this question for two attachment ambiguity problems relative clause RC attachment and prepositional phrase PP attachment. The supervised component is Collins parser Collins 1997 trained on the Wall Street Journal. The unsupervised component gathers lexical statistics from an unannotated corpus of newswire text. The sizes of both types of corpora annotated and unannotated are of interest. We would expect that large annotated .