tailieunhanh - Báo cáo khoa học: "Parsing and Subcategorization Data"

In this paper, we compare the performance of a state-of-the-art statistical parser (Bikel, 2004) in parsing written and spoken language and in generating subcategorization cues from written and spoken language. Although Bikel’s parser achieves a higher accuracy for parsing written language, it achieves a higher accuracy when extracting subcategorization cues from spoken language. Additionally, we explore the utility of punctuation in helping parsing and extraction of subcategorization cues. . | Parsing and Subcategorization Data Jianguo Li Department of Linguistics The Ohio State University Columbus OH USA j ianguo@ Abstract In this paper we compare the performance of a state-of-the-art statistical parser Bikel 2004 in parsing written and spoken language and in generating subcategorization cues from written and spoken language. Although Bikel s parser achieves a higher accuracy for parsing written language it achieves a higher accuracy when extracting subcategorization cues from spoken language. Additionally we explore the utility of punctuation in helping parsing and extraction of subcategorization cues. Our experiments show that punctuation is of little help in parsing spoken language and extracting subcategorization cues from spoken language. This indicates that there is no need to add punctuation in transcribing spoken corpora simply in order to help parsers. 1 Introduction Robust statistical syntactic parsers made possible by new statistical techniques Collins 1999 Charniak 2000 Bikel 2004 and by the availability of large hand-annotated training corpora such as WSJ Marcus et al. 1993 and Switchboard Godefrey et al. 1992 have had a major impact on the field of natural language processing. There are many ways to make use of parsers output. One particular form of data that can be extracted from parses is information about subcategorization. Subcategorization data comes in two forms subcategorization frame SCF and subcategorization cue SCC . SCFs differ from SCCs in that SCFs contain only arguments while SCCs contain both arguments and adjuncts. Both SCFs and SCCs have been crucial to NLP tasks. For example SCFs have been used for verb disambiguation and classification Schulte im Walde 2000 Merlo and Stevenson 2001 Lapata and Brew 2004 Merlo et al. 2005 and SCCs for semantic role labeling Xue and Palmer 2004 Punyakanok et al. 2005 . Current technology for automatically acquiring subcategorization data from corpora usually relies on .