tailieunhanh - Báo cáo khoa học: "Insights from Network Structure for Text Mining"

Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested, and usually bootstrap either the pattern learning or the term harvesting process (or both) in a recursive cycle, using data learned in one step to generate more seeds for the next. | Insights from Network Structure for Text Mining Zornitsa Kozareva and Eduard Hovy USC Information Sciences Institute 4676 Admiralty Way Marina del Rey CA 90292-6695 kozareva hovy @ Abstract Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested and usually bootstrap either the pattern learning or the term harvesting process or both in a recursive cycle using data learned in one step to generate more seeds for the next. They therefore treat the source text corpus as a network in which words are the nodes and relations linking them are the edges. The results of computational network analysis especially from the world wide web are thus applicable. Surprisingly these results have not yet been broadly introduced into the computational linguistics community. In this paper we show how various results apply to text mining how they explain some previously observed phenomena and how they can be helpful for computational linguistics applications. 1 Introduction Text mining harvesting algorithms have been applied in recent years for various uses including learning of semantic constraints for verb participants Lin and Pantel 2002 related pairs in various relations such as part-whole Girju et al. 2003 cause Pantel and Pennacchiotti 2006 and other typical information extraction relations large collections of entities Soderland et al. 1999 Etzioni et al. 2005 features of objects Pasca 2004 and ontologies Carlson et al. 2010 . They generally start with one or more seed terms and employ patterns that specify the desired information as it relates to the 1616 seed s . Several approaches have been developed specifically for learning patterns including guided pattern collection with manual filtering Riloff and Shepherd 1997 automated surface-level pattern induction Agichtein and Gravano 2000 Ravichan-dran and Hovy 2002 probabilistic methods for taxonomy .

TỪ KHÓA LIÊN QUAN