tailieunhanh - Báo cáo khoa học: "Parsing, Projecting & Prototypes: Repurposing Linguistic Data on the Web"
Until very recently, most NLP tasks (., parsing, tagging, etc.) have been confined to a very limited number of languages, the so-called majority languages. Now, as the field moves into the era of developing tools for Resource Poor Languages (RPLs)—a vast majority of the world’s 7,000 languages are resource poor—the discipline is confronted not only with the algorithmic challenges of limited data, but also the sheer difficulty of locating data in the first place. In this demo, we present a resource which taps the large body of linguistically annotated data on the Web, data which can be repurposed for. | Parsing Projecting Prototypes Repurposing Linguistic Data on the Web William D. Lewis Microsoft Research Redmond WA 98052 wilewis@ Fei Xia University of Washington Seattle WA 98195 fxia@ 1 Introduction Until very recently most NLP tasks . parsing tagging etc. have been confined to a very limited number of languages the so-called majority languages. Now as the field moves into the era of developing tools for Resource Poor Languages RPLs a vast majority of the world s languages are resource poor the discipline is confronted not only with the algorithmic challenges of limited data but also the sheer difficulty of locating data in the first place. In this demo we present a resource which taps the large body of linguistically annotated data on the Web data which can be repurposed for NLP tasks. Because the field of linguistics has as its mandate the study of human language in fact the study of all human languages and has wholeheartedly embraced the Web as a means for disseminating linguistic knowledge the consequence is that a large quantity of analyzed language data can be found on the Web. In many cases the data is richly annotated and exists for many languages for which there would otherwise be very limited annotated data. The resource the Online Database of INterlinear text ODIN makes this data available and provides additional annotation and structure making the resource useful to the Computational Linguistic audience. In this paper after a brief discussion of the previous work on ODIN we report our recent work on extending ODIN by applying machine learning methods to the task of data extraction and language identification and on using ODIN to discover linguistic knowledge. Then we outline a plan for the demo presentation. 2 Background and Previous work on odiN ODIN is a collection of Interlinear Glossed Text IGT harvested from scholarly documents. In this section we describe the original ODIN system Lewis 2006 and the IGT .
đang nạp các trang xem trước