tailieunhanh - Báo cáo khoa học: "XML-Based Data Preparation for Robust Deep Parsing"

We describe the use of XML tokenisation, tagging and mark-up tools to prepare a corpus for parsing. Our techniques are generally applicable but here we focus on parsing Medline abstracts with the ANLT wide-coverage grammar. Hand-crafted grammars inevitably lack coverage but many coverage failures are due to inadequacies of their lexicons. We describe a method of gaining a degree of robustness by interfacing POS tag information with the existing lexicon. We also show that XML tools provide a sophisticated approach to pre-processing, helping to ameliorate the ‘messiness’ in real language data and improve parse performance. . | XML-Based Data Preparation for Robust Deep Parsing Claire Grover and Alex Lascarides Division of Informatics The University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW UK @ Abstract We describe the use of XML tokenisa-tion tagging and mark-up tools to prepare a corpus for parsing. Our techniques are generally applicable but here we focus on parsing Medline abstracts with the ANLT wide-coverage grammar. Hand-crafted grammars inevitably lack coverage but many coverage failures are due to inadequacies of their lexicons. We describe a method of gaining a degree of robustness by interfacing POS tag information with the existing lexicon. We also show that XML tools provide a sophisticated approach to pre-processing helping to ameliorate the messiness in real language data and improve parse performance. 1 Introduction The field of parsing technology currently has two distinct strands of research with few points of contact between them. On the one hand there is thriving research on shallow parsing chunking and induction of statistical syntactic analysers from treebanks and on the other hand there are systems which use hand-crafted grammars which provide both syntactic and semantic coverage. Shallow approaches have good coverage on corpus data but extensions to semantic analysis are still in a relative infancy. The deep strand of research has two main problems inadequate coverage and a lack of reliable techniques to select the correct parse. In this paper we describe ongoing research which uses hybrid technologies to address the problem of inadequate coverage of a deep parsing system. In Section 2 we describe how we have modified an existing hand-crafted grammar s look-up procedure to utilise part-of-speech pos tag information thereby ameliorating the lexical information shortfall. In Section 3 we describe how we combine a variety of existing NLP tools to pre-process real data up to the point where a hand-crafted grammar can start to be .

TÀI LIỆU LIÊN QUAN