tailieunhanh - Báo cáo khoa học: "Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure"

Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. | Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure Minwoo Jeong and Ivan Titov Saarland University Saarbriicken Germany titov @ Abstract Documents often have inherently parallel structure they may consist of a text and commentaries or an abstract and a body or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments would help to visualize such documents and construct friendlier user interfaces. To address this problem we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We apply our method to the English as a second language podcast dataset where each episode is composed of two parallel parts a story and an explanatory lecture. The predicted topical links uncover hidden relations between the stories and the lectures. In this domain our method achieves competitive results rivaling those of a previously proposed supervised technique. 1 Introduction Many documents consist of parts exhibiting a high degree of parallelism . abstract and body of academic publications summaries and detailed news stories etc. This is especially common with the emergence of the Web technologies many texts on the web are now accompanied with comments and discussions. Segmentation of these parallel parts into coherent fragments and discovery of hidden relations between them would facilitate the development of better user interfaces and improve the performance of summarization and information retrieval systems. Discourse segmentation of the documents composed of parallel parts is a novel and challenging problem as previous research has mostly focused on the linear segmentation of isolated texts . Hearst 1994 . The most straightforward approach would be to use a pipeline strategy where an existing segmentation algorithm finds discourse boundaries of each part independently and then the .

TỪ KHÓA LIÊN QUAN