Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "A Statistical Model for Domain-Independent Text Segmentation"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system. | A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahara Communications Research Laboratory 2-2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0289 Japan mutiyama@crl.go.jp and isahara@crl.go.jp Abstract We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system. 1 Introduction Documents usually include various topics. Identifying and isolating topics by dividing documents which is called text segmentation is important for many natural language processing tasks including information retrieval Hearst and Plaunt 1993 Salton et al. 1996 and summarization Kan et al. 1998 Nakao 2000 . In information retrieval users are often interested in particular topics parts of retrieved documents instead of the documents themselves. To meet such needs documents should be segmented into coherent topics. Summarization is often used for a long document that includes multiple topics. A summary of such a document can be composed of summaries of the component topics. Identification of topics is the task of text segmentation. A lot of research has been done on text segmentation Kozima 1993 Hearst 1994 Oku-mura and Honda 1994 Salton et al. 1996 Yaari 1997 Kan et al. 1998 Choi 2000 Nakao 2000 . A major characteristic of the methods used in this research is that they do not require training data to segment given texts. Hearst 1994 for example used only the similarity of word distributions in a given text to segment the text. Consequently these methods can be applied to any text in any domain even if training data do not exist. This property is important when text segmentation is applied to information retrieval or summarization because both .