tailieunhanh - Báo cáo khoa học: "Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis"

Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, ., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. | Language of Vandalism Improving Wikipedia Vandalism Detection via Stylometric Analysis Manoj Harpalani Michael Hart Sandesh Singh Rob Johnson and Yejin Choi Department of Computer Science Stony Brook University NY 11794 USA mharpalani mhart sssingh rob ychoi @ Abstract Community-based knowledge forums such as Wikipedia are susceptible to vandalism . ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper we explore more linguistically motivated approaches to vandalism detection. In particular we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that 1 statistical models give evidence to unique language styles in vandalism and that 2 deep syntactic patterns based on probabilistic context free grammars PCFG discriminate vandalism more effectively than shallow lexico-syntactic patterns based on n-grams. 1 Introduction Wikipedia the free encyclopedia Wikipedia 2011 ranks among the top 200 most visited websites worldwide Alexa 2011 . This editable encyclopedia has amassed over 15 million articles across hundreds of languages. The English language encyclopedia alone has over million articles and receives over million edits and sometimes upwards of 3 million daily Wikipedia 2010 . But allowing anonymous edits is a double-edged sword nearly 7 Potthast 2010 of edits are vandalism . revisions to articles that undermine the quality and veracity of the content. As Wikipedia continues to grow it will become increasingly infeasible 83 for Wikipedia users and administrators to manually police articles. This pressing issue has spawned recent research activities to understand and counteract vandalism . Geiger and Ribes 2010 . Much of previous work relies on .

TỪ KHÓA LIÊN QUAN