tailieunhanh - Báo cáo khoa học: "Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora"

The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including wordaligned data during training. Incorporating wordlevel alignments into the parameter estimation of the IBM models reduces alignment error rate and increases the Bleu score when compared to training the same models only on sentence-aligned data. . | Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora Chris Callison-Burch David Talbot Miles Osborne School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW callison-burch@ Abstract The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including word-aligned data during training. Incorporating wordlevel alignments into the parameter estimation of the IBM models reduces alignment error rate and increases the Bleu score when compared to training the same models only on sentence-aligned data. On the Verbmobil data set we attain a 38 reduction in the alignment error rate and a higher Bleu score with half as many training examples. We discuss how varying the ratio of word-aligned to sentence-aligned data affects the expected performance gain. 1 Introduction Machine translation systems based on probabilistic translation models Brown et al. 1993 are generally trained using sentence-aligned parallel corpora. For many language pairs these exist in abundant quantities. However for new domains or uncommon language pairs extensive parallel corpora are often hard to come by. Two factors could increase the performance of statistical machine translation for new language pairs and domains a reduction in the cost of creating new training data and the development of more efficient methods for exploiting existing training data. Approaches such as harvesting parallel corpora from the web Resnik and Smith 2003 address the creation of data. We take the second complementary approach. We address the problem of efficiently exploiting existing parallel corpora by adding explicit word-level alignments between a number of the sentence pairs in the training corpus. We modify the standard parameter estimation procedure for IBM Models and HMM variants .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.