tailieunhanh - Báo cáo khoa học: "Authorship Attribution Using Probabilistic Context-Free Grammars"

In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy. (2008) use a combination of word-level statistics and part-of-speech counts or n-grams. . | Authorship Attribution Using Probabilistic Context-Free Grammars Sindhu Raghavan Adriana Kovashka Raymond Mooney Department of Computer Science The University of Texas at Austin 1 University Station C0500 Austin TX 78712-0233 USA sindhu adriana mooney @ Abstract In this paper we present a novel approach for authorship attribution the task of identifying the author of a document using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy. 1 Introduction Natural language processing allows us to build language models and these models can be used to distinguish between languages. In the context of written text such as newspaper articles or short stories the author s style could be considered a distinct language. Authorship attribution also referred to as authorship identification or prediction studies strategies for discriminating between the styles of different authors. These strategies have numerous applications including settling disputes regarding the authorship of old and historically important documents Mosteller and Wallace 1984 automatic plagiarism detection determination of document authenticity in court Juola and Sofko 2004 cyber crime investigation Zheng et al. 2009 and forensics Luyckx and Daelemans 2008 . The general approach to authorship attribution is to extract a number of style markers from the text and use these style markers as features to train a classifier Burrows 1987 Binongo and Smith 1999 Diederich et al. 2000 Holmes and Forsyth 1995 Joachims 1998 Mosteller and Wallace 1984 . These style markers could include the frequencies of certain characters function words phrases or sentences. Peng et al. 2003 build a character-level n-gram model for each author. Sta-matatos et al. 1999 and Luyckx and Daelemans 2008 .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN