tailieunhanh - Báo cáo khoa học: "Linear Text Segmentation using a Dynamic Programming Algorithm"

In this paper we introduce a dynamic programming algorithm to perform linear text segmentation by global minimization of a segmentation cost function which consists of: (a) within-segment word similarity and (b) prior information about segment length. The evaluation of the segmentation accuracy of the algorithm on Choi's text collection showed that the algorithm achieves the best segmentation accuracy so far reported in the literature. Keywords: Text Segmentation, Document Retrieval, Information Retrieval, Machine Learning. . | Linear Text Segmentation using a Dynamic Programming Algorithm Athanasios Kehagias Dept of Math. Phys and Comp. Sciences Aristotle Univ of Thessaloniki GREECE kehagias@ Fragkou Pavlina Vassilios Petridis Dept of Elect and Computer Eng. Aristotle Univ of Thessaloniki GREECE fragou@ petridis@ Abstract In this paper we introduce a dynamic programming algorithm to perform linear text segmentation by global minimization of a segmentation cost function which consists of a within-segment word similarity and b prior information about segment length. The evaluation of the segmentation accuracy of the algorithm on Choi s text collection showed that the algorithm achieves the best segmentation accuracy so far reported in the literature. Keywords Text Segmentation Document Retrieval Information Retrieval Machine Learning. 1 Introduction Text segmentation is an important problem in information retrieval. Its goal is the division of a text into homogeneous lexically coherent segments segments exhibiting the following properties a each segment deals with a particular subject and b contiguous segments deal with different subjects. Those segments can be retrieved from a large database of unformatted or loosely formatted text as being relevant to a query. This paper presents a dynamic programming algorithm which performs linear segmentation 1 by global minimization of a segmentation cost. The As opposed to hierarchical segmentation Yaari 1997 segmentation cost is defined by a function consisting of two factors a within-segment word similarity and b prior information about segment length. Our algorithm has the advantage of being able to be applied to either large texts - to segment them into their constituent parts . to segment an article into sections - or to a stream of independent concatenated texts . to segment a transcript of news into separate stories . For the calculation of the segment homogeneity or alternatively .

TỪ KHÓA LIÊN QUAN