tailieunhanh - Báo cáo khoa học: "How to thematically segment texts by using lexical cohesion?"

This article outlines a quantitative method for segmenting texts into thematically coherent units. This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words. We also present the results of an experiment about locating boundaries between a series of concatened texts. 1 Introduction Several quantitative methods exist for thematically segmenting texts. Most of them are based on the following assumption: the thematic coherence of a text segment finds expression at the lexical level. . | How to thematically segment texts by using lexical cohesion Olivier Ferret LIMSI-CNRS BP 133 F-91403 Orsay Cedex France ferret@ Abstract This article outlines a quantitative method for segmenting texts into thematically coherent units. This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words. We also present the results of an experiment about locating boundaries between a series of concatened texts. 1 Introduction Several quantitative methods exist for thematically segmenting texts. Most of them are based on the following assumption the thematic coherence of a text segment finds expression at the lexical level. Hearst 1997 and Nomoto and Nitta 1994 detect this coherence through patterns of lexical cooccurrence. Morris and Hirst 1991 and Kozima 1993 find topic boundaries in the texts by using lexical cohesion. The first methods are applied to texts such as expository texts whose vocabulary is often very specific. As a concept is always expressed by the same word word repetitions are thematically significant in these texts. The use of lexical cohesion allows to bypass the problem set by texts such as narratives in which a concept is often expressed by different means. However this second approach requires knowledge about the cohesion between words. Morris and Hirst 1991 extract this knowledge from a thesaurus. Koz-ima 1993 exploits a lexical network built from a machine readable dictionary MRD . This article presents a method for thematically segmenting texts by using knowledge about lexical cohesion that has been automatically built. This knowledge takes the form of a network of lexical collocations. We claim that this network is as suitable as a thesaurus or a MRD for segmenting texts. Moreover building it for a spe cific domain or for another language is quick. 2 Method The segmentation algorithm we propose includes two steps. First a computation

TÀI LIỆU MỚI ĐĂNG
8    164    3    26-12-2024
337    145    2    26-12-2024