tailieunhanh - Báo cáo khoa học: "Adapting a WSJ-Trained Parser to Grammatically Noisy Text"

We present a robust parser which is trained on a treebank of ungrammatical sentences. The treebank is created automatically by modifying Penn treebank sentences so that they contain one or more syntactic errors. We evaluate an existing Penn-treebank-trained parser on the ungrammatical treebank to see how it reacts to noise in the form of grammatical errors. We re-train this parser on the training section of the ungrammatical treebank, leading to an significantly improved performance on the ungrammatical test sets. . | Adapting a WSJ-Trained Parser to Grammatically Noisy Text Jennifer Foster Joachim Wagner and Josef van Genabith National Centre for Language Technology Dublin City University Ireland jfoster jwagner josef@ Abstract We present a robust parser which is trained on a treebank of ungrammatical sentences. The treebank is created automatically by modifying Penn treebank sentences so that they contain one or more syntactic errors. We evaluate an existing Penn-treebank-trained parser on the ungrammatical treebank to see how it reacts to noise in the form of grammatical errors. We re-train this parser on the training section of the ungrammatical treebank leading to an significantly improved performance on the ungrammatical test sets. We show how a classifier can be used to prevent performance degradation on the original grammatical data. 1 Introduction The focus in English parsing research in recent years has moved from Wall Street Journal parsing to improving performance on other domains. Our research aim is to improve parsing performance on text which is mildly ungrammatical . text which is well-formed enough to be understood by people yet which contains the kind of grammatical errors that are routinely produced by both native and nonnative speakers of a language. The intention is not to detect and correct the error but rather to ignore it. Our approach is to introduce grammatical noise into WSJ sentences while retaining as much of the structure of the original trees as possible. These sentences and their associated trees are then used as training material for a statistical parser. It is important that parsing on grammatical sentences is not harmed and we introduce a parse-probability-based classifier which allows both grammatical and ungrammatical sentences to be accurately parsed. 2 Background Various strategies exist to build robustness into the parsing process grammar constraints can be relaxed Fouvry 2003 partial parses can be concatenated to form a