tailieunhanh - Báo cáo khoa học: "Scaling Context Space"

Context is used in many NLP systems as an indicator of a term’s syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. . | Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics ACL Philadelphia July 2002 pp. 231-238. Scaling Context Space James R. Curran and Marc Moens Institute for Communicating and Collaborative Systems University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW United Kingdom jamesc marc @ Abstract Context is used in many NLP systems as an indicator of a term s syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However the quantity variable is no longer hxed by limited corpus resources. Given hxed training time and computational resources it makes sense for systems to invest time in extracting high quality contextual information from a hxed corpus. However with an effectively limitless quantity of text available extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity time and size on a corpus of 300 million words. 1 Introduction Context plays an important role in many natural language tasks. For example the accuracy of part of speech taggers or word sense disambiguation systems depends on the quality and quantity of contextual information these systems can extract from the training data. When predicting the sense of a word for instance the immediately preceding word is likely to be more important than the tenth previous word similar observations can be made about POS taggers or chunkers. A crucial part of training these systems lies in extracting from the data high-quality contextual information in the sense of dehning contexts that are both accurate and correlated with the information the POS tags the word senses the chunks the system is trying to extract. The quality of contextual information is often determined by the size of the training corpus with less data available .

TÀI LIỆU LIÊN QUAN