tailieunhanh - Báo cáo khoa học: "Chinese Verb Sense Discrimination Using an EM Clustering Model with Rich Linguistic Features"

The EM clustering algorithm (Hofmann and Puzicha, 1998) used here is an unsupervised machine learning algorithm that has been applied in many NLP tasks, such as inducing a semantically labeled lexicon and determining lexical choice in machine translation (Rooth et al., 1998), automatic acquisition of verb semantic classes (Schulte im Walde, 2000) and automatic semantic labeling (Gildea and Jurafsky, 2002). | Chinese Verb Sense Discrimination Using an EM Clustering Model with Rich Linguistic Features Jinying Chen Martha Palmer Department of Computer and Information Science University of Pennsylvania Philadelphia PA 19104 jinying mpalmer @ Abstract This paper discusses the application of the Expectation-Maximization EM clustering algorithm to the task of Chinese verb sense discrimination. The model utilized rich linguistic features that capture predicateargument structure information of the target verbs. A semantic taxonomy for Chinese nouns which was built semi-automatically based on two electronic Chinese semantic dictionaries was used to provide semantic features for the model. Purity and normalized mutual information were used to evaluate the clustering performance on 12 Chinese verbs. The experimental results show that the EM clustering model can learn sense or sense group distinctions for most of the verbs successfully. We further enhanced the model with certain fine-grained semantic categories called lexical sets. Our results indicate that these lexical sets improve the model s performance for the three most challenging verbs chosen from the first set of experiments. 1 Introduction Highly ambiguous words may lead to irrelevant document retrieval and inaccurate lexical choice in machine translation Palmer et al. 2000 which suggests that word sense disambiguation WSD is beneficial and sometimes even necessary in such NLP tasks. This paper addresses WSD in Chinese through developing an Expectation-Maximization EM clustering model to learn Chinese verb sense distinctions. The major goal is to do sense discrimination rather than sense labeling similar to Schutze 1998 . The basic idea is to divide instances of a word into several clusters that have no sense labels. The instances in the same cluster are regarded as having the same meaning. Word sense discrimination can be applied to document retrieval and similar tasks in information access and to .