tailieunhanh - Báo cáo khoa học: "Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling"

We present a new semi-supervised training procedure for conditional random fields (CRFs) that can be used to train sequence segmentors and labelers from a combination of labeled and unlabeled training data. Our approach is based on extending the minimum entropy regularization framework to the structured prediction case, yielding a training objective that combines unlabeled conditional entropy with labeled conditional likelihood. | Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling Feng Jiao University of Waterloo Abstract We present a new semi-supervised training procedure for conditional random fields CRFs that can be used to train sequence segmentors and labelers from a combination of labeled and unlabeled training data. Our approach is based on extending the minimum entropy regularization framework to the structured prediction case yielding a training objective that combines unlabeled conditional entropy with labeled conditional likelihood. Although the training objective is no longer concave it can still be used to improve an initial model . obtained from supervised training by iterative ascent. We apply our new training algorithm to the problem of identifying gene and protein mentions in biological texts and show that incorporating unlabeled data improves the performance of the supervised CRF in this case. 1 Introduction Semi-supervised learning is often touted as one of the most natural forms of training for language processing tasks since unlabeled data is so plentiful whereas labeled data is usually quite limited or expensive to obtain. The attractiveness of semisupervised learning for language tasks is further heightened by the fact that the models learned are large and complex and generally even thousands of labeled examples can only sparsely cover the parameter space. Moreover in complex structured prediction tasks such as parsing or sequence modeling part-of-speech tagging word segmentation named entity recognition and so on it is considerably more difficult to obtain labeled training data than for classification tasks such as document classification since hand-labeling individual words and word boundaries is much harder than assigning text-level class labels. Many approaches have been proposed for semisupervised learning in the past including generative models Castelli and Cover 1996 Cohen and Cozman 2006 Nigam et al. 2000 self-learning

TÀI LIỆU LIÊN QUAN