tailieunhanh - Báo cáo khoa học: "Phrase Clustering for Discriminative Learning"

We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers. To demonstrate the power and generality of this approach, we apply the method in two very different applications: named entity recognition and query classification. Our results show that phrase clusters offer significant improvements over word clusters. Our NER system achieves the best current result on the widely used CoNLL benchmark. Our query classifier is on par with the best system in KDDCUP 2005 without resorting to labor intensive knowledge engineering efforts. . | Phrase Clustering for Discriminative Learning Dekang Lin and Xiaoyun Wu Google Inc. 1600 Amphitheater Parkway Mountain View CA lindek xiaoyunwu @ Abstract We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers. To demonstrate the power and generality of this approach we apply the method in two very different applications named entity recognition and query classification. Our results show that phrase clusters offer significant improvements over word clusters. Our NER system achieves the best current result on the widely used CoNLL benchmark. Our query classifier is on par with the best system in KDDCUP 2005 without resorting to labor intensive knowledge engineering efforts. 1 Introduction Over the past decade supervised learning algorithms have gained widespread acceptance in natural language processing NLP . They have become the workhorse in almost all sub-areas and components of NLP including part-of-speech tagging chunking named entity recognition and parsing. To apply supervised learning to an NLP problem one first represents the problem as a vector of features. The learning algorithm then optimizes a regularized convex objective function that is expressed in terms of these features. The performance of such learning-based solutions thus crucially depends on the informativeness of the features. The majority of the features in these supervised classifiers are predicated on lexical information such as word identities. The long-tailed distribution of natural language words implies that most of the word types will be either unseen or seen very few times in the labeled training data even if the data set is a relatively large one . the Penn Treebank . While the labeled data is generally very costly to obtain there is a vast amount of unlabeled textual data freely available on the web. One way to alleviate the sparsity problem is to adopt a two-stage .