tailieunhanh - Báo cáo khoa học: "Text Segmentation with Multiple Surface Linguistic Cues"

In general, a certain range of sentences in a text, is widely assumed to form a coherent unit which is called a discourse segment. Identifying the segment boundaries is a first step to recognize the structure of a text. In this paper, we describe a method for identifying segment boundaries of a Japanese text with the aid of multiple surface linguistic cues, though our experiments might be small-scale. We also present a method of training the weights for multiple linguistic cues automatically without the overfitting problem. . | Text Segmentation with Multiple Surface Linguistic Cues MOCHIZUKI Hajime and HONDA Takeo and OKUMURA Manabu School of Information Science Japan Advanced Institute of Science and Technology Tatsunokuchi Ishikawa 923-1292 Japan Tel i 81-761 51-1216 Fax 81-761 51-1149 motizuki honda oku j Abstract In general a certain range of sentences in a text is widely assumed to form a coherent unit which is called a discourse segment. Identifying the segment boundaries is a first step to recognize the structure of a text. In this paper we describe a method for identifying segment boundaries of a Japanese text with the aid of multiple surface linguistic cues though our experiments might be small-scale. We also present a method of training the weights for multiple linguistic cues automatically without the overfitting problem. 1 Introduction A text consists of multiple sentences that have semantic relations with each other. They form semantic units which are usually called discourse segments. The global discourse structure of a text can be constructed by relating the discourse segments with each other. Therefore identifying segment boundaries in a text is considered as a first step to construct the discourse structure Grosz and Sidner 1986 . The use of surface linguistic cues in a text for identification of segment boundaries has been extensively researched since it is impractical to assume the use of world knowledge for discourse analysis of real texts. Among a variety of surface cues lexical cohesion Halliday and Hasan 1976 the surface relationship among words that are semantically similar has recently received much attention and has been widely used for text segmentation Morris and Hirst 1991 Kozima 1993 Hearst 1994 Okumura and Honda 1994 . Okumura and Honda Okumura and Honda 1994 found that the information of lexical cohesion is not enough and incorporation of other surface information may improve the accuracy. In this paper we describe a method for identifying .