tailieunhanh - Báo cáo khoa học: "Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling"

Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation. . | Good Seed Makes a Good Crop Accelerating Active Learning Using Language Modeling Dmitriy Dligach Martha Palmer Department of Computer Science Department of Linguistics University of Colorado at Boulder University of Colorado at Boulder Abstract Active Learning AL is typically initialized with a small seed of examples selected randomly. However when the distribution of classes in the data is skewed some classes may be missed resulting in a slow learning progress. Our contribution is twofold 1 we show that an unsupervised language modeling based technique is effective in selecting rare class examples and 2 we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation. 1 Introduction Active learning AL Settles 2009 has become a popular research field due to its potential benefits it can lead to drastic reductions in the amount of annotation that is necessary for training a highly accurate statistical classifier. Unlike in a random sampling approach where unlabeled data is selected for annotation randomly AL delegates the selection of unlabeled data to the classifier. In a typical AL setup a classifier is trained on a small sample of the data usually selected randomly known as the seed examples. The classifier is subsequently applied to a pool of unlabeled data with the purpose of selecting additional examples that the classifier views as informative. The selected data is annotated and the cycle is repeated allowing the learner to quickly refine the decision boundary between the classes. Unfortunately AL is susceptible to a shortcoming known as the missed cluster effect Schiitze et al. 2006 and its special case called the missed class 6 effect Tomanek et al. 2009 . The missed cluster effect is a consequence of the fact that seed examples influence the direction the learner takes in its exploration of the .

TỪ KHÓA LIÊN QUAN