tailieunhanh - Báo cáo khoa học: "An Unsupervised Approach to Biography Production using Wikipedia"

We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges. | An Unsupervised Approach to Biography Production using Wikipedia Fadi Biadsy Julia Hirschberg and Elena Filatova Department of Computer Science Columbia University New York NY 10027 USA fadi julia @ InforSense LLC Cambridge MA 02141 USA efilatova@ Abstract We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges. Overall our system significantly outperforms all systems that participated in DUC2004 according to the ROUGE-L metric and is preferred by human subjects. 1 Introduction Producing biographies by hand is a labor-intensive task generally done only for famous individuals. The process is particularly difficult when persons of interest are not well known and when information must be gathered from a wide variety of sources. We present an automatic unsupervised multi-document summarization MDS approach based on extractive techniques to producing biographies answering the question Who is X There is growing interest in automatic MDS in general due in part to the explosion of multilingual and multimedia data available online. The goal of MDS is to automatically produce a concise well-organized and fluent summary of a set of documents on the same topic. MDS strategies have been employed to produce both generic summaries and query-focused summaries. Due to the complexity of text generation most summarization systems employ sentence-extraction techniques in which the most relevant sentences from one or more documents are selected to represent the summary. This approach is guaranteed to produce grammatical .