tailieunhanh - Báo cáo khoa học: "User Edits Classification Using Document Revision Histories"

Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. . | User Edits Classification Using Document Revision Histories Amit Bronner Informatics Institute University of Amsterdam Christof Monz Informatics Institute University of Amsterdam Abstract Document revision histories are a useful and abundant source of data for natural language processing but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities string similarity measured over different representations of user edits comparison of part-of-speech tags and named entities and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy 88 and is shown to generalize to additional sets of unseen data. 1 Introduction Many online collaborative editing projects such as Wikipedia1 keep track of complete revision histories. These contain valuable information about the evolution of documents in terms of content as well as language style and form. Such data is publicly available in large volumes and constantly growing. According to Wikipedia statistics in August 2011 the English Wikipedia contained million articles with an average of revisions per article. The average number of revision edits per month is about 4 million in English and almost 11 million in total for all 1 http 2Average for the 5 years period between August 2006 and August 2011. The count includes edits by registered Exploiting document revision histories has proven useful for a variety of natural language processing NLP tasks including sentence compression Nelken and Yamangil 2008 Yamangil and Nelken .

TỪ KHÓA LIÊN QUAN