tailieunhanh - Báo cáo khoa học: "The Sentimental Factor: Improving Review Classification via Human-Provided Information"

Sentiment classification is the task of labeling a review document according to the polarity of its prevailing opinion (favorable or unfavorable). In approaching this problem, a model builder often has three sources of information available: a small collection of labeled documents, a large collection of unlabeled documents, and human understanding of language. Ideally, a learning method will utilize all three sources. To accomplish this goal, we generalize an existing procedure that uses the latter two. We extend this procedure by re-interpreting it as a Naive Bayes model for document sentiment. . | The Sentimental Factor Improving Review Classification via Human-Provided Information Philip Beineke and Trevor Hastie Dept. of Statistics Stanford University Stanford CA 94305 Shivakumar Vaithyanathan IBM Almaden Research Center 650 Harry Rd. San Jose CA 95120-6099 Abstract Sentiment classification is the task of labeling a review document according to the polarity of its prevailing opinion favorable or unfavorable . In approaching this problem a model builder often has three sources of information available a small collection of labeled documents a large collection of unlabeled documents and human understanding of language. Ideally a learning method will utilize all three sources. To accomplish this goal we generalize an existing procedure that uses the latter two. We extend this procedure by re-interpreting it as a Naive Bayes model for document sentiment. Viewed as such it can also be seen to extract a pair of derived features that are linearly combined to predict sentiment. This perspective allows us to improve upon previous methods primarily through two strategies incorporating additional derived features into the model and where possible using labeled data to estimate their relative influence. 1 Introduction Text documents are available in ever-increasing numbers making automated techniques for information extraction increasingly useful. Traditionally most research effort has been directed towards objective information such as classification according to topic however interest is growing in producing information about the opinions that a document contains for instance Morinaga et al. 2002 . In March 2004 the American Association for Artificial Intelligence held a symposium in this area entitled Exploring Affect and Attitude in Text. One task in opinion extraction is to label a review document d according to its prevailing sentiment s 2 1 1 unfavorable or favorable . Several previous papers have addressed this problem by building models that rely exclusively