tailieunhanh - Báo cáo khoa học: " A Noisy-Channel Model for Document Compression"
We present a document compression system that uses a hierarchical noisy-channel model of text production. Our compression system first automatically derives the syntactic structure of each sentence and the overall discourse structure of the text given as input. The system then uses a statistical hierarchical model of text production in order to drop non-important syntactic and discourse constituents so as to generate coherent, grammatical document compressions of arbitrary length. The system outperforms both a baseline and a sentence-based compression system that operates by simplifying sequentially all sentences in a text. . | Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics ACL Philadelphia July 2002 pp. 449-456. A Noisy-Channel Model for Document Compression Hal Daume III and Daniel Marcu Information Sciences Institute University of Southern California 4676 Admiralty Way Suite 1001 Marina del Rey Ca 90292 hdaume marcu @ Abstract We present a document compression system that uses a hierarchical noisy-channel model of text production. Our compression system first automatically derives the syntactic structure of each sentence and the overall discourse structure of the text given as input. The system then uses a statistical hierarchical model of text production in order to drop non-important syntactic and discourse constituents so as to generate coherent grammatical document compressions of arbitrary length. The system outperforms both a baseline and a sentence-based compression system that operates by simplifying sequentially all sentences in a text. Our results support the claim that discourse knowledge plays an important role in document summarization. 1 Introduction Single document summarization systems proposed to date fall within one of the following three classes Extractive summarizers simply select and present to the user the most important sentences in a text see Mani and Maybury 1999 Marcu 2000 Mani 2001 for comprehensive overviews of the methods and algorithms used to accomplish this. Headline generators are noisy-channel probabilistic systems that are trained on large corpora of Headline Text pairs Banko et al. 2000 Berger and Mittal 2000 . These systems produce short sequences of words that are indicative of the content of the text given as input. Sentence simplification systems Chandrasekar et al. 1996 Mahesh 1997 Carroll et al. 1998 Grefenstette 1998 Jing 2000 Knight and Marcu 2000 are capable of compressing long sentences by deleting unimportant words and phrases. Extraction-based summarizers often produce outputs that contain .
đang nạp các trang xem trước