tailieunhanh - Báo cáo khoa học: "The Manually Annotated Sub-Corpus: A Community Resource For and By the People"

The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or ported to any of a variety of other formats. | The Manually Annotated Sub-Corpus A Community Resource For and By the People Nancy Ide Department of Computer Science Vassar College Poughkeepsie NY USA ide@ Christiane Fellbaum Princeton University Princeton New Jersey USA fellbaum@ Abstract The Manually Annotated Sub-Corpus MASC project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single usable format that can then be analyzed as it is or ported to any of a variety of other formats. MASC includes data from a much wider variety of genres than existing multiply-annotated corpora of English and the project is committed to a fully open model of distribution without restriction for all data and annotations produced or contributed. As such MASC is the first large-scale open communitybased effort to create much needed language resources for NLP. This paper describes the MASC project its corpus and annotations and serves as a call for contributions of data and annotations from the language processing community. 1 Introduction The need for corpora annotated for multiple phenomena across a variety of linguistic layers is keenly recognized in the computational linguistics community. Several multiply-annotated corpora exist especially for Western European languages and for spoken data but interestingly broadbased English language corpora with robust annotation for diverse linguistic phenomena are relatively rare. The most widely-used corpus of English the British National Corpus contains only part-of-speech annotation and although it contains a wider range of annotation types the fif- Collin Baker International Computer Science Institute Berkeley California USA collinb@ Rebecca Passonneau Columbia University New York New York UsA becky@ teen million word Open American National Corpus annotations .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.