tailieunhanh - Báo cáo khoa học: "Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach"

In this paper we describe a new technique for parsing free text: a transformational grammar I is automatically learned that is capable of accurately parsing text into binary-branching syntactic trees with nonterminals unlabelled. The algorithm works by beginning in a very naive state of knowledge about phrase structure. By repeatedly comparing the results of bracketing in the current state to proper bracketing provided in the training corpus, the system learns a set of simple structural transformations that can be applied to reduce error. After describing the algorithm, we present results and compare these results to other recent results. | Automatic Grammar Induction and Parsing Free Text A Transformation-Based Approach Eric Brill Department of Computer and Information Science University of Pennsylvania brill@ Abstract In this paper we describe a new technique for parsing free text a transformational grammar1 is automatically learned that is capable of accurately parsing text into binary-branching syntactic trees with nonterminals unlabelled. The algorithm works by beginning in a very naive state of knowledge about phrase structure. By repeatedly comparing the results of bracketing in the current state to proper bracketing provided in the training corpus the system learns a set of simple structural transformations that can be applied to reduce error. After describing the algorithm we present results and compare these results to other recent results in automatic grammar induction. INTRODUCTION There has been a great deal of interest of late in the automatic induction of natural language grammar. Given the difficulty inherent in manually building a robust parser along with the availability of large amounts of training material automatic grammar induction seems like a path worth pursuing. A number of systems have been built that can be trained automatically to bracket text into syntactic constituents. In MM90 mutual information statistics are extracted from a corpus of text and this information is then used to parse new text. Sam86 defines a function to score the quality of parse trees and then uses simulated annealing to heuristically explore the entire space of possible parses for a given sentence. In BM92a distributional analysis techniques are applied to a large corpus to learn a context-free grammar. The most promising results to date have been The author would like to thank Mark Liberman Meiting Lu David Magerman Mitch Marcus Rich Pito Giorgio Satta Yves Schabes and Tom Veatch. This work was supported by DARPA and AFOSR jointly under grant No. AFOSR-90-0066 and by ARO grant No.