tailieunhanh - Báo cáo khoa học: "Towards Robust Context-Sensitive Sentence Alignment for Monolingual CorporaRani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge, MA 02138 nelken,shieber @deas.harvard.edu¡ Abstract"
Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms, for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems. . | Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora Rani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge MA 02138 nelken shieber @ Abstract Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm combining a sentence-based TF IDF score turned into a probability distribution using logistic regression with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems. 1 Introduction Sentence-aligned bilingual corpora are a crucial resource for training statistical machine translation systems. Several authors have suggested that large-scale aligned monolingual corpora could be similarly used to advance the performance of monolingual text-to-text rewriting systems for tasks including summarization Knight and Marcu 2000 Jing 2002 and paraphrasing Barzilay and Elhadad 2003 Quirk et al. 2004 . Unlike bilingual corpora such as the Canadian Hansard corpus which are relatively rare it is now fairly easy to amass corpora of related monolingual documents. For instance with the advent of news aggregator services such as Google News one can readily collect multiple news stories covering the same news item Dolan et al. 2004 . Utilizing such a resource requires aligning related documents at a finer level of resolution identifying which sentences from one document align with which sentences from the other. Previous work has shown that aligning related monolingual documents is quite different from the well-studied multi-lingual alignment task. Whereas documents in a bilingual corpus are typically very closely aligned monolingual corpora exhibit a much looser level of .
đang nạp các trang xem trước