tailieunhanh - Báo cáo khoa học: "Thematic segmentation of texts: two methods for two kinds of texts"

To segment texts in thematic units, we present here how a basic principle relying on word distribution can be applied on different kind of texts. We start from an existing method well adapted for scientific texts, and we propose its adaptation to other kinds of texts by using semantic links between words. These relations are found in a lexical network, automatically built from a large corpus. We will compare their results and give criteria to choose the more suitable method according to text characteristics. . | Thematic segmentation of texts two methods for two kinds of texts Olivier FERRET LIMSI-CNRS Bat. 508 -BP 133 F-91403 Orsay Cedex France ferret@ Brigitte GRAU LIMSI-CNRS Bât 508 - BP 133 F-91403 Orsay Cedex France grau@ Nicolas MASSON LIMSI-CNRS Bât 508 - BP 133 F-91403 Orsay Cedex France masson@ Abstract To segment texts in thematic units we present here how a basic principle relying on word distribution can be applied on different kind of texts. We start from an existing method well adapted for scientific texts and we propose its adaptation to other kinds of texts by using semantic links between words. These relations are found in a lexical network automatically built from a large corpus. We will compare their results and give criteria to choose the more suitable method according to text characteristics. 1. Introduction Text segmentation according to a topical criterion is a useful process in many applications such as text summarization or information extraction task. Approaches that address this problem can be classified in knowledge-based approaches or word-based approaches. Knowledge-based systems as Grosz and Sidner s 1986 require an extensive manual knowledge engineering effort to create the knowledge base semantic network and or frames and this is only possible in very limited and well-known domains. To overcome this limitation and to process a large amount of texts word-based approaches have been developed. Hearst 1997 and Masson 1995 make use of the word distribution in a text to find a thematic segmentation. These works are well adapted to technical or scientific texts characterized by a specific vocabulary. To process narrative or expository texts such as newspaper articles Kozima s 1993 and Morris and Hirst s 1991 approaches are based on lexical cohesion computed from a lexical network. These methods depend on the presence of the text vocabulary inside their network. So to avoid any restriction about domains in such kinds of .