Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Native Language Detection with Tree Substitution Grammars"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We investigate the potential of Tree Substitution Grammars as a source of features for native language detection, the task of inferring an author’s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. | Native Language Detection with Tree Substitution Grammars Ben Swanson Brown University chonger@cs.brown.edu Eugene Charniak Brown University ec@cs.brown.edu Abstract We investigate the potential of Tree Substitution Grammars as a source of features for native language detection the task of inferring an author s native language from text in a different language. We compare two state of the art methods for Tree Substitution Grammar induction and show that features from both methods outperform previous state of the art results at native language detection. Furthermore we contrast these two induction algorithms and show that the Bayesian approach produces superior classification results with a smaller feature set. 1 Introduction The correlation between a person s native language L1 and aspects of their writing in a second language L2 can be exploited to predict L1 label given L2 text. The International Corpus of Learner English Granger et al 2002 or ICLE is a large set of English student essays annotated with L1 labels that allows us to bring the power of supervised machine learning techniques to bear on this task. In this work we explore the possibility of automatically induced Tree Substitution Grammar TSG rules as features for a logistic regression model1 trained to predict these L1 labels. Automatic TSG induction is made difficult by the exponential number of possible TSG rules given a corpus. This is an active area of research with two distinct effective solutions. The first uses a nonparametric Bayesian model to handle the large number 1 a.k.a. Maximum Entropy Model 193 of rules Cohn and Blunsom 2010 while the second is inspired by tree kernel methods and extracts common subtrees from pairs of parse trees Sangati and Zuidema 2011 . While both are effective we show that the Bayesian method of TSG induction produces superior features and achieves a new best result at the task of native language detection. 2 Related Work 2.1 Native Language Detection Work in .