tailieunhanh - Báo cáo khoa học: "Creating a CCGbank and a wide-coverage CCG lexicon for German"

We present an algorithm which creates a German CCGbank by translating the syntax graphs in the German Tiger corpus into CCG derivation trees. The resulting corpus contains 46,628 derivations, covering 95% of all complete sentences in Tiger. Lexicons extracted from this corpus contain correct lexical entries for 94% of all known tokens in unseen text. | Creating a CCGbank and a wide-coverage CCG lexicon for German Julia Hockenmaier Institute for Research in Cognitive Science University of Pennsylvania Philadelphia PA 19104 USA juliahr@ Abstract We present an algorithm which creates a German CCGbank by translating the syntax graphs in the German Tiger corpus into CCG derivation trees. The resulting corpus contains 46 628 derivations covering 95 of all complete sentences in Tiger. Lexicons extracted from this corpus contain correct lexical entries for 94 of all known tokens in unseen text. 1 Introduction A number of wide-coverage TAG CCG LFG and HPSG grammars Xia 1999 Chen et al. 2005 Hockenmaier and Steedman 2002a O Donovan et al. 2005 Miyao et al. 2004 have been extracted from the Penn Treebank Marcus et al. 1993 and have enabled the creation of wide-coverage parsers for English which recover local and non-local dependencies that approximate the underlying predicate-argument structure Hocken-maier and Steedman 2002b Clark and Curran 2004 Miyao and Tsujii 2005 Shen and Joshi 2005 . However many corpora Bohomva et al. 2003 Skut et al. 1997 Brants et al. 2002 use dependency graphs or other representations and the extraction algorithms that have been developed for Penn Treebank style corpora may not be immediately applicable to this representation. As a consequence research on statistical parsing with deep grammars has largely been confined to English. Free-word order languages typically pose greater challenges for syntactic theories Rambow 1994 and the richer inflectional morphology of these languages creates additional problems both for the coverage of lexicalized formalisms such as CCG or TAG and for the usefulness of dependency counts extracted from the training data. On the other hand formalisms such as CCG and TAG are particularly suited to capture the cross ing dependencies that arise in languages such as Dutch or German and by choosing an appropriate linguistic representation some of these .