tailieunhanh - Báo cáo khoa học: "Ad Hoc Treebank Structures"

We outline the problem of ad hoc rules in treebanks, rules used for specific constructions in one data set and unlikely to be used again. These include ungeneralizable rules, erroneous rules, rules for ungrammatical text, and rules which are not consistent with the rest of the annotation scheme. Based on a simple notion of rule equivalence and on the idea of finding rules unlike any others, we develop two methods for detecting ad hoc rules in flat treebanks and show they are successful in detecting such rules. This is done by examining evidence across the grammar and without making. | Ad Hoc Treebank Structures Markus Dickinson Department of Linguistics Indiana University md7@ Abstract We outline the problem of ad hoc rules in treebanks rules used for specific constructions in one data set and unlikely to be used again. These include ungeneralizable rules erroneous rules rules for ungrammatical text and rules which are not consistent with the rest of the annotation scheme. Based on a simple notion of rule equivalence and on the idea of finding rules unlike any others we develop two methods for detecting ad hoc rules in flat treebanks and show they are successful in detecting such rules. This is done by examining evidence across the grammar and without making any reference to context. 1 Introduction and Motivation When extracting rules from constituency-based treebanks employing flat structures grammars often limit the set of rules . Charniak 1996 due to the large number of rules Krotov et al. 1998 and leaky rules that can lead to mis-analysis Foth and Menzel 2006 . Although frequency-based criteria are often used these are not without problems because low-frequency rules can be valid and potentially useful rules see . Daelemans et al. 1999 and high-frequency rules can be erroneous see. . Dickinson and Meurers 2005 . A key issue in determining the rule set is rule generalizability will these rules be needed to analyze new data This issue is of even more importance when considering the task of porting a parser trained on one genre to another genre . Gildea 2001 . Infrequent rules in one genre may be quite frequent in another Sekine 1997 and their frequency may be unrelated to their usefulness for parsing Foth and Menzel 2006 . Thus we need to carefully consider the applicability of rules in a treebank to new text. Specifically we need to examine ad hoc rules rules used for particular constructions specific to one data set and unlikely to be used on new data. This is why low-frequency rules often do not extend to new data if