tailieunhanh - Báo cáo khoa học: "Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank"

This paper presents empirical studies and closely corresponding theoretical models of the performance of a chart parser exhaustively parsing the Penn Treebank with the Treebank’s own CFG grammar. We show how performance is dramatically affected by rule representation and tree transformations, but little by top-down vs. bottom-up strategies. | Parsing with Treebank Grammars Empirical Bounds Theoretical Models and the Structure of the Penn Treebank Dan Klein and Christopher D. Manning Computer Science Department Stanford University Stanford CA 94305-9040 Klein manning @ Abstract This paper presents empirical studies and closely corresponding theoretical models of the performance of a chart parser exhaustively parsing the Penn Treebank with the Treebank s own CFG grammar. We show how performance is dramatically affected by rule representation and tree transformations but little by top-down vs. bottom-up strategies. We discuss grammatical saturation including analysis of the strongly connected components of the phrasal nonterminals in the Treebank and model how as sentence length increases the effective grammar rule size increases as regions of the grammar are unlocked yielding super-cubic observed time behavior in some configurations. 1 Introduction This paper originated from examining the empirical performance of an exhaustive active chart parser using an untransformed treebank grammar over the Penn Treebank. Our initial experiments yielded the surprising result that for many configurations empirical parsing speed was super-cubic in the sentence length. This led us to look more closely at the structure of the treebank grammar. The resulting analysis builds on the presentation of Charniak 1996 but extends it by elucidating the structure of non-terminal interrelationships in the Penn Treebank grammar. On the basis of these studies we build simple theoretical models which closely predict observed parser performance and in particular explain the originally observed super-cubic behavior. We used treebank grammars induced directly from the local trees of the entire WSJ section of the Penn Treebank Marcus et al. 1993 release 3 . For each length and parameter setting 25 sentences evenly distributed through the treebank were parsed. Since we were parsing sentences from among those from which our .