Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Generating Usable Formats for Metadata and Annotations in a Large Meeting Corpus"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The AMI Meeting Corpus is now publicly available, including manual annotation files generated in the NXT XML format, but lacking explicit metadata for the 171 meetings of the corpus. To increase the usability of this important resource, a representation format based on relational databases is proposed, which maximizes informativeness, simplicity and reusability of the metadata and annotations. | Generating Usable Formats for Metadata and Annotations in a Large Meeting Corpus Andrei Popescu-Belis and Paula Estrella ISSCO TIM ETI University of Geneva 40 bd. du Pont-d Arve 1211 Geneva 4 - Switzerland andrei.popescu-belis paula.estrella @issco.unige.ch Abstract The AMI Meeting Corpus is now publicly available including manual annotation files generated in the NXT XML format but lacking explicit metadata for the 171 meetings of the corpus. To increase the usability of this important resource a representation format based on relational databases is proposed which maximizes informativeness simplicity and reusability of the metadata and annotations. The annotation files are converted to a tabular format using an easily adaptable XSLT-based mechanism and their consistency is verified in the process. Metadata files are generated directly in the IMDI XML format from implicit information and converted to tabular format using a similar procedure. The results and tools will be freely available with the AMI Corpus. Sharing the metadata using the Open Archives network will contribute to increase the visibility of the AMI Corpus. 1 Introduction The AMI Meeting Corpus Carletta and al. 2006 is one of the largest and most extensively annotated data sets of multimodal recordings of human interaction. The corpus contains 171 meetings in English for a total duration of ca. 100 hours. The meetings either follow the remote control design scenario or are naturally occurring meetings. In both cases they have between 3 and 5 participants. Perhaps the most valuable resources in this corpus are the high quality annotations which can be 93 used to train and test NLP tools. The existing annotation dimensions include beside transcripts forced temporal alignment named entities topic segmentation dialogue acts abstractive and extractive summaries as well as hand and head movement and posture. However these dimensions as well as the implicit metadata for the corpus are difficult to exploit .