tailieunhanh - Báo cáo khoa học: "A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing"

Morphological processes in Semitic languages deliver space-delimited words which introduce multiple, distinct, syntactic units into the structure of the input sentence. These words are in turn highly ambiguous, breaking the assumption underlying most parsers that the yield of a tree for a given sentence is known in advance. Here we propose a single joint model for performing both morphological segmentation and syntactic disambiguation which bypasses the associated circularity. Using a treebank grammar, a data-driven lexicon, and a linguistically motivated unknown-tokens handling technique our model outperforms previous pipelined, integrated or factorized systems for Hebrew morphological and syntactic processing, yielding an. | A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing Yoav Goldberg Reut Tsarfaty Ben Gurion University of the Negev Institute for Logic Language and Computation Department of Computer Science University of Amsterdam POB 653 Be er Sheva 84105 Israel Plantage Muidergracht 24 Amsterdam NL yoavg@ rtsarfat@ Abstract Morphological processes in Semitic languages deliver space-delimited words which introduce multiple distinct syntactic units into the structure of the input sentence. These words are in turn highly ambiguous breaking the assumption underlying most parsers that the yield of a tree for a given sentence is known in advance. Here we propose a single joint model for performing both morphological segmentation and syntactic disambiguation which bypasses the associated circularity. Using a treebank grammar a data-driven lexicon and a linguistically motivated unknown-tokens handling technique our model outperforms previous pipelined integrated or factorized systems for Hebrew morphological and syntactic processing yielding an error reduction of 12 over the best published results so far. 1 Introduction Current state-of-the-art broad-coverage parsers assume a direct correspondence between the lexical items ingrained in the proposed syntactic analyses the yields of syntactic parse-trees and the space-delimited tokens henceforth tokens that constitute the unanalyzed surface forms utterances . In Semitic languages the situation is very different. In Modern Hebrew Hebrew a Semitic language with very rich morphology particles marking conjunctions prepositions complementizers and rela-tivizers are bound elements prefixed to the word Glinert 1989 . The Hebrew token bcl 1 for example stands for the complete prepositional phrase 1We adopt here the transliteration of Sima an et al. 2001 . in the shadow . This token may further embed into a larger utterance . bcl hneim literally in-the-shadow the-pleasant meaning .