tailieunhanh - Báo cáo khoa học: "Iterative Scaling and Coordinate Descent Methods for Maximum Entropy"

Maximum entropy (Maxent) is useful in many areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for IS methods. This framework also connects IS and coordinate descent (CD) methods. Besides, we develop a CD method for Maxent. Results show that it is faster than existing iterative scaling methods1 . | Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Fang-Lan Huang Cho-Jui Hsieh Kai-Wei Chang and Chih-Jen Lin Department of Computer Science National Taiwan University Taipei 106 Taiwan d93011 b92085 b92084 cjlin @ Abstract Maximum entropy Maxent is useful in many areas. Iterative scaling IS methods are one of the most popular approaches to solve Maxent. With many variants of IS methods it is difficult to understand them and see the differences. In this paper we create a general and unified framework for IS methods. This framework also connects IS and coordinate descent CD methods. Besides we develop a CD method for Maxent. Results show that it is faster than existing iterative scaling methods1. 1 Introduction Maximum entropy Maxent is widely used in many areas such as natural language processing NLP and document classification. Maxent models the conditional probability as Pw y x Sw x y Tw x 1 Sw x y eP wtft x y Tw x 52y Sw x y where x indicates a context y is the label of the context and w G Rn is the weight vector. A function ft x y denotes the t-th feature extracted from the context x and the label y . Given an empirical probability distribution P x y obtained from training samples Maxent minimizes the following negative log-likelihood minw -Ex y p x y logpw y x 2 Exp x logTw x -Et wtp ft where p x Ey p x y is the marginal probability of x and p ft 52xy p x y ft x y is the expected value of ft x y . To avoid overfitting the training samples some add a regularization term and solve 2 min L w E P x logTw x -52 Wtpft x 3 1A complete version of this work is at http cjlin papers . where Ơ is a regularization parameter. We focus on 3 instead of 2 because 3 is strictly convex. Iterative scaling IS methods are popular in training Maxent models. They all share the same property of solving a one-variable sub-problem at a time. Existing IS methods include generalized iterative scaling GIS by Darroch .