tailieunhanh - Báo cáo khoa học: "Instance Weighting for Domain Adaptation in NLP"

Domain adaptation is an important problem in natural language processing (NLP) due to the lack of labeled data in novel domains. In this paper, we study the domain adaptation problem from the instance weighting perspective. We formally analyze and characterize the domain adaptation problem from a distributional view, and show that there are two distinct needs for adaptation, corresponding to the different distributions of instances and classification functions in the source and the target domains. . | Instance Weighting for Domain Adaptation in NLP Jing Jiang and ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Urbana IL 61801 USA jiang4 czhai @ Abstract Domain adaptation is an important problem in natural language processing NLP due to the lack of labeled data in novel domains. In this paper we study the domain adaptation problem from the instance weighting perspective. We formally analyze and characterize the domain adaptation problem from a distributional view and show that there are two distinct needs for adaptation corresponding to the different distributions of instances and classification functions in the source and the target domains. We then propose a general instance weighting framework for domain adaptation. Our empirical results on three NLP tasks show that incorporating and exploiting more information from the target domain through instance weighting is effective. 1 Introduction Many natural language processing NLP problems such as part-of-speech POS tagging named entity NE recognition relation extraction and semantic role labeling are currently solved by supervised learning from manually labeled data. A bottleneck problem with this supervised learning approach is the lack of annotated data. As a special case we often face the situation where we have a sufficient amount of labeled data in one domain but have little or no labeled data in another related domain which we are interested in. We thus face the domain adaptation problem. Following Blitzer et al. 2006 we 264 call the first the source domain and the second the target domain. The domain adaptation problem is commonly encountered in NLP. For example in POS tagging the source domain may be tagged WSJ articles and the target domain may be scientific literature that contains scientific terminology. In NE recognition the source domain may be annotated news articles and the target domain may be personal blogs. Another example is personalized spam .