tailieunhanh - Báo cáo khoa học: "The Columbia Arabic Treebank"

The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on speed with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach: no annotation of redundant information and using representations and terminology inspired by traditional Arabic syntax. We describe CATiB’s representation and annotation procedure, and report on interannotator agreement and speed. . | CATiB The Columbia Arabic Treebank Nizar Habash and Ryan M. Roth Center for Computational Learning Systems Columbia University New York USA habash ryanr @ Abstract The Columbia Arabic Treebank CATiB is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on speed with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach no annotation of redundant information and using representations and terminology inspired by traditional Arabic syntax. We describe CATiB s representation and annotation procedure and report on interannotator agreement and speed. 1 Introduction and Motivation Treebanks are collections of manually-annotated syntactic analyses of sentences. They are primarily intended for building models for statistical parsing however they are often enriched for general natural language processing purposes. For Arabic two important treebanking efforts exist the Penn Arabic Treebank PATB Maamouri et al. 2004 and the Prague Arabic Dependency Treebank PADT Smrz and Hajic 2006 . In addition to syntactic annotations both resources are annotated with rich morphological and semantic information such as full part-of-speech POS tags lemmas semantic roles and diacritizations. This allows these treebanks to be used for training a variety of applications other than parsing such as tokenization diacritization POS tagging morphological disambiguation base phrase chunking and semantic role labeling. In this paper we describe a new Arabic treebanking effort the Columbia Arabic Treebank CATiB .1 CATiB is motivated by the following three observations. First as far as parsing Arabic research much of the non-syntactic rich annotations are not used. For example PATB has over 400 tags but they are typically reduced to around 36 tags in training and testing parsers Kulick et 1This work was supported by Defense Advanced Research Projects Agency Contract No. .