Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present an enriched version of the Penn Arabic Treebank (Maamouri et al., 2004), where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation, and present the first quantitative analysis of Arabic morphosyntactic phenomena. | A Corpus for Modeling Morpho-Syntactic Agreement in Arabic Gender Number and Rationality Sarah Alkuhlani and NizarHabash Center for Computational Learning Systems Columbia University salkuhlani habash @ccls.columbia.edu Abstract We present an enriched version of the Penn Arabic Treebank Maamouri et al. 2004 where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation and present the first quantitative analysis of Arabic morpho-syntactic phenomena. 1 Introduction Arabic morphology is complex partly because of its richness and partly because of its complex morpho-syntactic agreement rules which depend on features not necessarily expressed in word forms such as lexical rationality and functional gender and number. In this paper we present an enriched version of the Penn Arabic Treebank PATB part 3 Maamouri et al. 2004 that we manually annotated for these features.1 We describe a process for how to do the annotation efficiently and furthermore present the first quantitative analysis of morpho-syntactic phenomena in Arabic. This resource is important for building computational models of Arabic morphology and syntax that account for morpho-syntactic agreement patterns. It has already been used to demonstrate added value for Arabic dependency parsing Marton et al. 2011 . This paper is structured as follows Sections 2 and 3 present relevant linguistic facts and related work respectively. Section 4 describes our annotation process and Section 5 presents an analysis of the annotated corpus. 1The annotations are publicly available for research purposes. Please contact authors. The PATB must be acquired through the Linguistic Data Consortium LDC http www.ldc.upenn.edu . 2 Linguistic Facts Arabic has a rich and complex morphology. In addition to being both templatic root pattern and con-catenative stems affixes clitics Arabic s optional diacritics add to the degree of word ambiguity .