Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Word Clustering and Disambiguation Based on Co-occurrence Data"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs, such as noun verb pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability distribution. Our method is a natural extension of those proposed in (Brown et al., 1992) and (Li and Abe, 1996), and overcomes their drawbacks while retaining their advantages. . | Word Clustering and Disambiguation Based on Co-occurrence Data Hang Li and Naoki Abe Theory NEC Laboratory Real World Computing Partnership c o c c Media Research Laboratories. NEC 4-1-1 Miyazaki Miyamae-ku Kawasaki 216-8555 Japan lihang abe @ccm.cl.nec.co.jp Abstract We address the problem of clustering words or constructing a thesaurus based on co-occurrence data and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs such as noun verb pairs. We propose an efficient algorithm based on the Minimum Description Length MDL principle for estimating such a probability distribution. Our method is a natural extension of those proposed in Brown et al. 1992 and Li and Abe 1996 and overcomes their drawbacks while retaining their advantages. We then combined this clustering method with the disambiguation method of Li and Abe 1995 to derive a disambiguation method that makes use of both automatically constructed thesauruses and a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 85.2 which compares favorably against the accuracy 82.4 obtained by the state-of-the-art disambiguation method of Brill and Resnik 1994 . 1 Introduction We address the problem of clustering words or that of constructing a thesaurus based on co-occurrence data. We view this problem as that of estimating a joint probability distribution over word pairs specifying the joint probabilities of word pairs such as noun verb pairs. In this paper we assume that the joint distribution can be expressed in the following manner which is stated for noun verb pairs for the sake of readability The joint probability of a noun and a verb is expressed as the product of the joint probability of the noun class and the verb class which the noun and the verb respectively belong to and the conditional probabilities of the noun and the .