tailieunhanh - Báo cáo khoa học: "Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words"

We present a novel approach for discovering word categories, sets of words sharing a significant aspect of their meaning. We utilize meta-patterns of highfrequency words and content words in order to discover pattern candidates. Symmetric patterns are then identified using graph-based measures, and word categories are created based on graph clique sets. Our method is the first pattern-based method that requires no corpus annotation or manually provided seed patterns or words. We evaluate our algorithm on very large corpora in two languages, using both human judgments and WordNetbased evaluation. . | Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov ICNC The Hebrew University Jerusalem 91904 Israel dmitry@ Ari Rappoport Institute of Computer Science The Hebrew University Jerusalem 91904 Israel arir Abstract We present a novel approach for discovering word categories sets of words sharing a significant aspect of their meaning. We utilize meta-patterns of high-frequency words and content words in order to discover pattern candidates. Symmetric patterns are then identified using graph-based measures and word categories are created based on graph clique sets. Our method is the first pattern-based method that requires no corpus annotation or manually provided seed patterns or words. We evaluate our algorithm on very large corpora in two languages using both human judgments and WordNet-based evaluation. Our fully unsupervised results are superior to previous work that used a POS tagged corpus and computation time for huge corpora are orders of magnitude faster than previously reported. 1 Introduction Lexical resources are crucial in most NLP tasks and are extensively used by people. Manual compilation of lexical resources is labor intensive error prone and susceptible to arbitrary human decisions. Hence there is a need for automatic authoring that would be as unsupervised and languageindependent as possible. An important type of lexical resource is that given by grouping words into categories. In general the notion of a category is a fundamental one in cognitive psychology Matlin 2005 . A lexical category is a set of words that share a significant aspect of their meaning . sets of words denoting vehicles types of food tool names etc. A word can obviously belong to more than a single category. We will use category instead of lexical category for brevity1. Grouping of words into categories is useful in itself . for the construction of thesauri and can serve as