tailieunhanh - Báo cáo khoa học: "MULTI-PARAGRAPH SEGMENTATION EXPOSITORY TEXT"
This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes. Two fully-implemented versions of the algorithm are described and shown to produce segmentation that corresponds well to human judgments of the major subtopic boundaries of thirteen lengthy texts. . | MULTI-PARAGRAPH SEGMENTATION OF EXPOSITORY TEXT Marti A. Hearst Computer Science Division 571 Evans Hall University of California Berkeley Berkeley CA 94720 and Xerox Palo Alto Research Center marti@ Abstract This paper describes TextTiling an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes. Two fully-implemented versions of the algorithm are described and shown to produce segmentation that corresponds well to human judgments of the major subtopic boundaries of thirteen lengthy texts. INTRODUCTION The structure of expository texts can be characterized as a sequence of subtopical discussions that occur in the context of a few main topic discussions. For example a popular science text called Stargazers whose main topic is the existence of life on earth and other planets can be described as consisting of the following subdiscussions numbers indicate paragraph numbers 1-3 Intro - the search for life in space 4-5 The moon s chemical composition 6-8 How early proximity of the moon shaped it 9-12 How the moon helped life evolve on earth 13 Improbability of the earth-moon system 14-16 Binary trinary star systems make life unlikely 17-18 The low probability of non-binary trinary systems 19-20 Properties of our sun that facilitate life 21 Summary Subtopic structure is sometimes marked in technical texts by headings and subheadings which divide the text into coherent segments Brown Yule 1983 140 state that this kind of division is one of the most basic in discourse. However many expository texts consist of long sequences of paragraphs with very little structural demarcation. This paper presents fully-implemented algorithms that use lexical cohesion relations to partition expository texts into multi-paragraph segments that .
đang nạp các trang xem trước