tailieunhanh - Báo cáo khoa học: "Novel Association Measures Using Web Search with Double Checking"

A web search with double checking model is proposed to explore the web as a live corpus. Five association measures including variants of Dice, Overlap Ratio, Jaccard, and Cosine, as well as CoOccurrence Double Check (CODC), are presented. In the experiments on Rubenstein-Goodenough’s benchmark data set, the CODC measure achieves correlation coefficient , which competes with the performance () of the model using WordNet. | Novel Association Measures Using Web Search with Double Checking Hsin-Hsi Chen Ming-Shun Lin Yu-Chuan Wei Department of Computer Science and Information Engineering National Taiwan University Taipei Taiwan hhchen@ mslin ycwei @ Abstract A web search with double checking model is proposed to explore the web as a live corpus. Five association measures including variants of Dice Overlap Ratio Jaccard and Cosine as well as CoOccurrence Double Check CODC are presented. In the experiments on Ruben-stein-Goodenough s benchmark data set the CODC measure achieves correlation coefficient which competes with the performance of the model using WordNet. The experiments on link detection of named entities using the strategies of direct association association matrix and scalar association matrix verify that the double-check frequencies are reliable. Further study on named entity clustering shows that the five measures are quite useful. In particular CODC measure is very stable on wordword and name-name experiments. The application of CODC measure to expand community chains for personal name disambiguation achieves and increase compared to the system without community expansion. All the experiments illustrate that the novel model of web search with double checking is feasible for mining associations from the web. 1 Introduction In statistical natural language processing resources used to compute the statistics are indispensable. Different kinds of corpora have made available and many language models have been experimented. One major issue behind the corpus-based approaches is if corpora adopted can reflect the up-to-date usage. As we know languages are live. New terms and phrases are used in daily life. How to capture the new usages is an important research topic. The Web is a heterogeneous document collection. Huge-scale and dynamic nature are characteristics of the Web. Regarding the Web as a live corpus becomes an .