tailieunhanh - Báo cáo khoa học: "Minimum Risk Annealing for Training Log-Linear Models∗"

When training the parameters for a natural language system, one would prefer to minimize 1-best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize log-likelihood, which is conveniently differentiable and convex. We propose training instead to minimize the expected loss, or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen (anneal) to focus on the 1-best hypothesis. . | Minimum Risk Annealing for Training Log-Linear Models David A. Smith and Jason Eisner Department of Computer Science Center for Language and Speech Processing Johns Hopkins University Baltimore MD 21218 USA dasmith eisner @ Abstract When training the parameters for a natural language system one would prefer to minimize 1-best loss error on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima many systems instead optimize log-likelihood which is conveniently differentiable and convex. We propose training instead to minimize the expected loss or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen anneal to focus on the 1-best hypothesis. Besides the linear loss functions used in previous work we also describe techniques for optimizing nonlinear functions such as precision or the Bleu metric. We present experiments training log-linear combinations of models for dependency parsing and for machine translation. In machine translation annealed minimum risk training achieves significant improvements in Bleu over standard minimum error training. We also show improvements in labeled dependency parsing. 1 Direct Minimization of Error Researchers in empirical natural language processing have expended substantial ink and effort in developing metrics to evaluate systems automatically against gold-standard corpora. The ongoing evaluation literature is perhaps most obvious in the machine translation community s efforts to better Bleu Papineni et al. 2002 . Despite this research parsing or machine translation systems are often trained using the much simpler and harsher metric of maximum likelihood. One reason is that in supervised training the log-likelihood objective function is generally convex meaning that it has a single global maximum that can be easily found indeed for supervised generative models the parameters at this maximum may even .