tailieunhanh - Báo cáo khoa học: "Creating a Corpus of Parse-Annotated Questions"

however these are often based on a specific text type or genre, . financial newspaper text (the Penn-II Treebank (Marcus et al., 1993)). This can limit the applicability of grammatical resources induced from treebanks in that such resources underperform when used on a different type of text or for a specific task. In this paper we present work on creating QuestionBank, a treebank of parse-annotated questions, which can be used as a supplementary training resource to allow parsers to accurately parse questions (as well as other text). . | QuestionBank Creating a Corpus of Parse-Annotated Questions John Judge1 Aoife Cahill1 and Josef van Genabith1 2 National Centre for Language Technology and School of Computing Dublin City University Dublin Ireland 2IBM Dublin Center for Advanced Studies IBM Dublin Ireland jjudge acahill josef @ Abstract This paper describes the development of QuestionBank a corpus of 4000 parse-annotated questions for i use in training parsers employed in QA and ii evaluation of question parsing. We present a series of experiments to investigate the effectiveness of QuestionBank as both an exclusive and supplementary training resource for a state-of-the-art parser in parsing both question and non-question test sets. We introduce a new method for recovering empty nodes and their antecedents capturing long distance dependencies from parser output in CFG trees using LFG f-structure reentrancies. Our main findings are i using QuestionBank training data improves parser performance to labelled bracketing f-score an increase of almost 11 over the baseline ii back-testing experiments on nonquestion data Penn-II WSJ Section 23 shows that the retrained parser does not suffer a performance drop on non-question material iii ablation experiments show that the size of training material provided by QuestionBank is sufficient to achieve optimal results iv our method for recovering empty nodes captures long distance dependencies in questions from the ATIS corpus with high precision and low recall . In summary QuestionBank provides a useful new resource in parser-based QA research. 1 Introduction Parse-annotated corpora treebanks are crucial for developing machine learning and statistics-based parsing resources for a given language or task. Large treebanks are available for major languages however these are often based on a specific text type or genre . financial newspaper text the Penn-II Treebank Marcus et al. 1993 . This can limit the applicability of .