tailieunhanh - Báo cáo khoa học: "Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking"
We address the problem dealing with skewed data, and propose a method for estimating effective training stories for the topic tracking task. For a small number of labelled positive stories, we extract story pairs which consist of positive and its associated stories from bilingual comparable corpora. To overcome the problem of a large number of labelled negative stories, we classify them into some clusters. This is done by using k-means with EM. The results on the TDT corpora show the effectiveness of the method. . | Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking Fumiyo Fukumoto Interdisciplinary Graduate School of Medicine and Engineering Univ. of Yamanashi fukumoto@ Yoshimi Suzuki Interdisciplinary Graduate School of Medicine and Engineering Univ. of Yamanashi ysuzuki@ Abstract We address the problem dealing with skewed data and propose a method for estimating effective training stories for the topic tracking task. For a small number of labelled positive stories we extract story pairs which consist of positive and its associated stories from bilingual comparable corpora. To overcome the problem of a large number of labelled negative stories we classify them into some clusters. This is done by using k-means with EM. The results on the TDT corpora show the effectiveness of the method. 1 Introduction With the exponential growth of information on the Internet it is becoming increasingly difficult to find and organize relevant materials. Topic Tracking defined by the TDT project is a research area to attack the problem. It starts from a few sample stories and finds all subsequent stories that discuss the target topic. Here a topic in the TDT context is something that happens at a specific place and time associated with some specific actions. A wide range of statistical and ML techniques have been applied to topic tracking Carbonell et. al 1999 Oard 1999 Franz 2001 Larkey 2004 . The main task of these techniques is to tune the parameters or the threshold to produce optimal results. However parameter tuning is a tricky issue for tracking Yang 2000 because the number of initial positive training stories is very small one to four and topics are localized in space and time. For example Taipei Mayoral Elections and . Mid-term Elections are topics but Elections is not a topic. Therefore the system needs to estimate whether or not the test stories are the same topic with few information about the topic. Moreover the .
đang nạp các trang xem trước