tailieunhanh - Báo cáo khoa học: "SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts"

SenseClusters is a freely available system that identifies similar contexts in text. It relies on lexical features to build first and second order representations of contexts, which are then clustered using unsupervised methods. It was originally developed to discriminate among contexts centered around a given target word, but can now be applied more generally. It also supports methods that create descriptive and discriminating labels for the discovered clusters. | SenseClusters Unsupervised Clustering and Labeling of Similar Contexts Anagha Kulkarni and Ted Pedersen Department of Computer Science University of Minnesota Duluth MN 55812 kulka020 tpederse @ http Abstract SenseClusters is a freely available system that identifies similar contexts in text. It relies on lexical features to build first and second order representations of contexts which are then clustered using unsupervised methods. It was originally developed to discriminate among contexts centered around a given target word but can now be applied more generally. It also supports methods that create descriptive and discriminating labels for the discovered clusters. 1 Introduction SenseClusters seeks to group together units of text referred to as contexts that are similar to each other using lexical features and unsupervised clustering. Our initial work Purandare and Pedersen 2004 focused on word sense discrimination which takes as input contexts that each contain a given target word and produces as output clusters that are presumed to correspond to the different senses of the word. This follows the hypothesis of Miller and Charles 1991 that words that occur in similar contexts will have similar meanings. We have shown that these methods can be extended to proper name discrimination Pedersen et al. 2005 . People places or companies often share the same name and this can cause a considerable amount of confusion when carrying out Web search or other information retrieval applications. Name discrimination seeks to group together the contexts that refer to a unique underlying individual and allow the user to recognize that the same name is being used to refer to multiple entities. We have also extended SenseClusters to cluster contexts that are not centered around any target word which we refer to as headless clustering. Automatic email categorization is an example of a headless clustering task since each message can be considered