tailieunhanh - Báo cáo khoa học: "Optimal Multi-Paragraph Text Segmentation by Dynamic Programming"

There exist several methods of calculating a similarity curve, or a sequence of similarity values, representing the lexical cohesion of successive text constituents, ., paragraphs. Methods for deciding the locations of fragment boundaries are, however, scarce. We propose a fragmentation method based on dynamic programming. The method is theoretically sound and guaranteed to provide an optimal splitting on the basis of a similarity curve, a preferred fragment length, and a cost function defined. . | Optimal Multi-Paragraph Text Segmentation by Dynamic Programming Oskari Heinonen University of Helsinki Department of Computer Science . Box 26 Teollisuuskatu 23 FIN-00014 University of Helsinki Finland Abstract There exist several methods of calculating a similarity curve or a sequence of similarity values representing the lexical cohesion of successive text constituents . paragraphs. Methods for deciding the locations of fragment boundaries are however scarce. We propose a fragmentation method based on dynamic programming. The method is theoretically sound and guaranteed to provide an optimal splitting on the basis of a similarity curve a preferred fragment length and a cost function defined. The method is especially useful when control on fragment size is of importance. 1 Introduction Electronic full-text documents and digital libraries make the utilization of texts much more effective than before yet they pose new problems and requirements. For example document retrieval based on string searches typically returns either the whole document or just the occurrences of the searched words. What the user often is after however is microdocument a part of the document that contains the occurrences and is reasonably self-contained. Microdocuments can be created by utilizing lexical cohesion term repetition and semantic relations present in the text. There exist several methods of calculating a similarity curve or a sequence of similarity values representing the lexical cohesion of successive constituents such as paragraphs of text see . Hearst 1994 Hearst 1997 Koz-ima 1993 Morris and Hirst 1991 Yaari 1997 Youmans 1991 . Methods for deciding the locations of fragment boundaries are however not that common and those that exist are often rather heuristic in nature. To evaluate our fragmentation method to be explained in Section 2 we calculate the paragraph similarities as follows. We employ stemming remove stopwords and count the .

TÀI LIỆU LIÊN QUAN