tailieunhanh - Báo cáo khoa học: "Unsupervised Learning of Field Segmentation Models for Information Extraction"
The applicability of many current information extraction techniques is severely limited by the need for supervised training data. We demonstrate that for certain field structured extraction tasks, such as classified advertisements and bibliographic citations, small amounts of prior knowledge can be used to learn effective models in a primarily unsupervised fashion. Although hidden Markov models (HMMs) provide a suitable generative model for field structured text, general unsupervised HMM learning fails to learn useful structure in either of our domains. However, one can dramatically improve the quality of the learned structure by exploiting simple prior knowledge of the desired solutions | Unsupervised Learning of Field Segmentation Models for Information Extraction Trond Grenager Computer Science Department Stanford University Stanford CA 94305 grenager@cs. Dan Klein Computer Science Division . Berkeley Berkeley CA 94709 klein@ Christopher D. Manning Computer Science Department Stanford University Stanford CA 94305 manning@ Abstract The applicability of many current information extraction techniques is severely limited by the need for supervised training data. We demonstrate that for certain field structured extraction tasks such as classified advertisements and bibliographic citations small amounts of prior knowledge can be used to learn effective models in a primarily unsupervised fashion. Although hidden Markov models HMMs provide a suitable generative model for field structured text general unsupervised HMM learning fails to learn useful structure in either of our domains. However one can dramatically improve the quality of the learned structure by exploiting simple prior knowledge of the desired solutions. In both domains we found that unsupervised methods can attain accuracies with 400 unlabeled examples comparable to those attained by supervised methods on 50 labeled examples and that semi-supervised methods can make good use of small amounts of labeled data. 1 Introduction Information extraction is potentially one of the most useful applications enabled by current natural language processing technology. However unlike general tools like parsers or taggers which generalize reasonably beyond their training domains extraction systems must be entirely retrained for each application. As an example consider the task of turning a set of diverse classified advertisements into a queryable database each type of ad would require tailored training data for a supervised system. Approaches which required little or no training data would therefore provide substantial resource savings and extend the practicality
đang nạp các trang xem trước