Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Toward Statistical Machine Translation without Parallel Corpora"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We estimate the parameters of a phrasebased statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrasetables. We propose a novel algorithm to estimate reordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is. | Toward Statistical Machine Translation without Parallel Corpora Alexandre Klementiev Ann Irvine Chris Callison-Burch David Yarowsky Center for Language and Speech Processing Johns Hopkins University Abstract We estimate the parameters of a phrasebased statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrasetables. We propose a novel algorithm to estimate reordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages a small bilingual dictionary and a small bitext for tuning feature weights. In this paper we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed and show that 80 of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features. 1 Introduction The parameters of statistical models of translation are typically estimated from large bilingual parallel corpora Brown et al. 1993 . However these resources are not available for most language pairs and they are expensive to produce in quantities sufficient for building a good translation system Germann 2001 . We attempt an entirely different approach we use cheap and plentiful monolingual resources to induce an end-to-end statistical machine translation system. In particular we extend the long line of work on inducing translation lexicons beginning with Rapp 1995 and propose to use multiple independent cues present in monolingual texts to estimate lexical and phrasal translation probabilities for .