tailieunhanh - Báo cáo khoa học: "Finding document topics for improving topic segmentation"

Topic segmentation and identification are often tackled as separate problems whereas they are both part of topic analysis. In this article, we study how topic identification can help to improve a topic segmenter based on word reiteration. We first present an unsupervised method for discovering the topics of a text. Then, we detail how these topics are used by segmentation for finding topical similarities between text segments. Finally, we show through the results of an evaluation done both for French and English the interest of the method we propose. . | Finding document topics for improving topic segmentation Olivier Ferret CEA LIST LIC2M 18 route du Panorama BP6 Fontenay aux Roses F-92265 France ferreto@ Abstract Topic segmentation and identification are often tackled as separate problems whereas they are both part of topic analysis. In this article we study how topic identification can help to improve a topic segmenter based on word reiteration. We first present an unsupervised method for discovering the topics of a text. Then we detail how these topics are used by segmentation for finding topical similarities between text segments. Finally we show through the results of an evaluation done both for French and English the interest of the method we propose. 1 Introduction In this article we address the problem of linear topic segmentation which consists in segmenting documents into topically homogeneous segments that does not overlap each other. This part of the Discourse Analysis field has received a constant interest since the initial work in this domain such as Hearst 1994 . One criterion for classifying topic segmentation systems is the kind of knowledge they depend on. Most of them only rely on surface features of documents word reiteration in Hearst 1994 Choi 2000 Utiyama and Isahara 2001 Galley et al. 2003 or discourse cues in Passonneau and Lit-man 1997 Galley et al. 2003 . As such systems do not require external knowledge they are not sensitive to domains but they are limited by the type of documents they can be applied to lexical reiteration is reliable only if concepts are not too frequently ex-480 pressed by several means synonyms etc. and discourse cues are often rare and corpus-specific. To overcome these difficulties some systems make use of domain-independent knowledge about lexical cohesion a lexical network built from a dictionary in Kozima 1993 a thesaurus in Morris and Hirst 1991 a large set of lexical cooccurrences collected from a corpus in Choi et al. 2001 . To a certain extent .