tailieunhanh - Báo cáo khoa học: "Probabilistic Document Modeling for Syntax Removal in Text Summarization"

Statistical approaches to automatic text summarization based on term frequency continue to perform on par with more complex summarization methods. To compute useful frequency statistics, however, the semantically important words must be separated from the low-content function words. The standard approach of using an a priori stopword list tends to result in both undercoverage, where syntactical words are seen as semantically relevant, and overcoverage, where words related to content are ignored. . | Probabilistic Document Modeling for Syntax Removal in Text Summarization William M. Darling School of Computer Science University of Guelph 50 Stone Rd E Guelph ON N1G 2W1 Canada wdarling@ Fei Song School of Computer Science University of Guelph 50 Stone Rd E Guelph ON N1G 2W1 Canada fsong@ Abstract Statistical approaches to automatic text summarization based on term frequency continue to perform on par with more complex summarization methods. To compute useful frequency statistics however the semantically important words must be separated from the low-content function words. The standard approach of using an a priori stopword list tends to result in both undercoverage where syntactical words are seen as semantically relevant and overcoverage where words related to content are ignored. We present a generative probabilistic modeling approach to building content distributions for use with statistical multi-document summarization where the syntax words are learned directly from the data with a Hidden Markov Model and are thereby deemphasized in the term frequency statistics. This approach is compared to both a stopword-list and POS-tagging approach and our method demonstrates improved coverage on the DUC 2006 and TAC 2010 datasets using the ROUGE metric. 1 Introduction While the dominant problem in Information Retrieval in the first part of the century was finding relevant information within a datastream that is exponentially growing the problem has arguably transitioned from finding what we are looking for to sifting through it. We can now be quite confident that search engines like Google will return several pages relevant to our queries but rarely does one have time to go through the enormous amount of data that is 642 supplied. Therefore automatic text summarization which aims at providing a shorter representation of the salient parts of a large amount of information has been steadily growing in both importance and popularity over the last .

TỪ KHÓA LIÊN QUAN