tailieunhanh - Báo cáo khoa học: "Fast and accurate query-based multi-document summarization"

We present a fast query-based multi-document summarizer called FastSum based solely on word-frequency features of clusters, documents and topics. Summary sentences are ranked by a regression SVM. The summarizer does not use any expensive NLP techniques such as parsing, tagging of names or even part of speech information. | FastSum Fast and accurate query-based multi-document summarization Frank Schilder and Ravikumar Kondadadi Research Development Thomson Corp. 610 Opperman Drive Eagan MN 55123 USA Abstract We present a fast query-based multi-document summarizer called FastSum based solely on word-frequency features of clusters documents and topics. Summary sentences are ranked by a regression SVM. The summa-rizer does not use any expensive NLP techniques such as parsing tagging of names or even part of speech information. Still the achieved accuracy is comparable to the best systems presented in recent academic competitions . Document Understanding Conference DUC . Because of a detailed feature analysis using Least Angle Regression LARS FastSum can rely on a minimal set of features leading to fast processing times 1250 news documents in 60 seconds. 1 Introduction In this paper we propose a simple method for effectively generating query-based multi-document summaries without any complex processing steps. It only involves sentence splitting filtering candidate sentences and computing the word frequencies in the documents of a cluster topic description and the topic title. We use a machine learning technique called regression SVM as proposed by Li et al. 2007 . For the feature selection we use a new model selection technique called Least Angle Regression LARS Efron et al. 2004 . Even though machine learning approaches dominated the field of summarization systems in recent DUC competitions not much effort has been spent in finding simple but effective features. Exceptions are the SumBasic system that achieves reasonable results with only one feature . word frequency in document clusters Nenkova and Vanderwende 2005 . Our approach goes beyond SumBasic by proposing an even more powerful feature that proves to be the best predictor in all three recent DUC corpora. In order to prove that our feature is more predictive than other features we provide a .

TỪ KHÓA LIÊN QUAN