tailieunhanh - Báo cáo khoa học: "Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation"
In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination of multiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. | Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation Bing Xiang and Abraham Ittycheriah IBM T. J. Watson Research Center Yorktown Heights NY 10598 bxiang abei @ Abstract In this paper we present a novel discriminative mixture model for statistical machine translation SMT . We model the feature space with a log-linear combination of multiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways such as based on feature types word alignments or domains for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task. 1 Introduction Significant progress has been made in statistical machine translation SMT in recent years. Among all the proposed approaches the phrasebased method Koehn et al. 2003 has become the widely adopted one in SMT due to its capability of capturing local context information from adjacent words. There exists significant amount of work focused on the improvement of translation performance with better features. The feature set could be either small at the order of 10 or large up to millions . For example the system described in Koehn 424 et al. 2003 is a widely known one using small number of features in a maximum-entropy log-linear model Och and Ney 2002 . The features include phrase translation probabilities lexical probabilities number of phrases and language model scores etc. The feature weights are usually optimized with minimum error rate training MERT as in Och 2003 . Besides the MERT-based feature .
đang nạp các trang xem trước