tailieunhanh - Báo cáo khoa học: "Minimum Cut Model for Spoken Lecture Segmentation"
We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors. | Minimum Cut Model for Spoken Lecture Segmentation Igor Malioutov and Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology igorm regina @ Abstract We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account long-range cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors. 1 Introduction The development of computational models of text structure is a central concern in natural language processing. Text segmentation is an important instance of such work. The task is to partition a text into a linear sequence of topically coherent segments and thereby induce a content structure of the text. The applications of the derived representation are broad encompassing information retrieval question-answering and summarization. Not surprisingly text segmentation has been extensively investigated over the last decade. Following the first unsupervised segmentation approach by Hearst 1994 most algorithms assume that variations in lexical distribution indicate topic changes. When documents exhibit sharp variations in lexical distribution these algorithms are likely to detect segment boundaries accurately. For example most algorithms achieve high performance on synthetic collections generated by concatenation of random text blocks Choi 2000 . The difficulty arises however when transitions between topics are smooth and distributional variations are subtle. This is evident in the performance of existing unsupervised algorithms on less structured datasets such as spoken meeting transcripts Galley et al. 2003 . Therefore a more refined analysis of lexical distribution is needed. Our work addresses this challenge by casting text segmentation in a .
đang nạp các trang xem trước