tailieunhanh - Báo cáo khoa học: "Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty"

Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1-regularization, which is becoming popular in natural language processing because of its ability to produce compact models, cannot be efficiently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gradients. . | Stochastic Gradient Descent Training for Ll-regularized Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka1 Jun ichi Tsujiitt Sophia Ananiadou1 1 School of Computer Science University of Manchester UK National Centre for Text Mining NaCTeM UK Department of Computer Science University of Tokyo Japan @ Abstract Stochastic gradient descent SGD uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However L1-regularization which is becoming popular in natural language processing because of its ability to produce compact models cannot be efficiently applied in SGD training due to the large dimensions of feature vectors and the fluctuations of approximate gradients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications text chunking named entity recognition and part-of-speech tagging. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasiNewton method for L1-regularized log-linear models. l Introduction Log-linear models maximum entropy models are one of the most widely-used probabilistic models in the field of natural language processing NLP . The applications range from simple classification tasks such as text classification and history-based tagging Ratnaparkhi 1996 to more complex structured prediction tasks such as part-of-speech POS tagging Lafferty et al. 2001 syntactic parsing Clark and Curran 2004 and semantic role labeling Toutanova et al. 2005 . Log-linear models have a major advantage over other discriminative machine learning models such as support vector machines their .