tailieunhanh - Báo cáo khoa học: "Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT"

With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. | Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT Patrick Simianer and Stefan Riezler Department of Computational Linguistics Heidelberg University 69120 Heidelberg Germany simianer riezler @ Chris Dyer Language Technologies Institute Carnegie Mellon University Pittsburgh PA 15213 USA cdyer@ Abstract With a few exceptions discriminative training in statistical machine translation SMT has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime and present a learning algorithm that applies 11112 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on million training sentences and show significant improvements over tuning discriminative models on small development sets. 1 Introduction The standard SMT training pipeline combines scores from large count-based translation models and language models with a few other features and tunes these using the well-understood line-search technique for error minimization of Och 2003 . If only a handful of dense features need to be tuned minimum error rate training can be done on small tuning sets and is hard to beat in terms of accuracy and efficiency. In contrast the promise of large-scale discriminative training for SMT is to scale to arbitrary types and numbers of features and to provide sufficient statistical support by parameter estimation on large sample sizes. Features may be lex-icalized and sparse non-local and overlapping or 11 be designed to generalize beyond surface statistics by incorporating part-of-speech or syntactic labels. The modeler s .

TỪ KHÓA LIÊN QUAN