tailieunhanh - Báo cáo khoa học: "An Unsupervised Model for Joint Phrase Alignment and Extraction"

We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs). The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal, but also non-terminal symbols. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs. . | An Unsupervised Model for Joint Phrase Alignment and Extraction Graham Neubig1 2 Taro Watanabe2 Eiichiro Sumita2 Shinsuke Mori1 Tatsuya Kawahara1 Graduate School of Informatics Kyoto University Yoshida Honmachi Sakyo-ku Kyoto Japan 2National Institute of Information and Communication Technology 3-5 Hikari-dai Seika-cho Soraku-gun Kyoto Japan Abstract We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars ITGs . The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal but also non-terminal symbols. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs. Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional two-step word alignment phrase extraction approach while reducing the phrase table to a fraction of the original size. 1 Introduction The training of translation models for phrasebased statistical machine translation SMT systems Koehn et al. 2003 takes unaligned bilingual training data as input and outputs a scored table of phrase pairs. This phrase table is traditionally generated by going through a pipeline of two steps first generating word or minimal phrase alignments then extracting a phrase table that is consistent with these alignments. However as DeNero and Klein 2010 note this two step approach results in word alignments that are not optimal for the final task of generating 632 phrase tables that are used in translation. As a solution to this they proposed a supervised discriminative model that performs joint word alignment and phrase extraction and found that joint estimation of word alignments and extraction sets improves both word .

TỪ KHÓA LIÊN QUAN