tailieunhanh - Báo cáo khoa học: "Recent Improvements in the CMU Large Scale Chinese-English SMT System"

In this paper we describe recent improvements to components and methods used in our statistical machine translation system for ChineseEnglish used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing, larger statistical models and a POS-based word reordering approach. | Recent Improvements in the CMU Large Scale Chinese-English SMT System Almut Silja Hildebrand Kay Rottmann Mohamed Noamany Qin Gao Sanjika Hewavitharana Nguyen Bach and Stephan Vogel Language Technologies Institute Carnegie Mellon University PittsbUrgh PA 15213 UsA silja kayrm mfn qing sanjika nbach vogel @ Abstract In this paper we describe recent improvements to components and methods used in our statistical machine translation system for Chinese-English used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing larger statistical models and a POS-based word reordering approach. 1 Introduction Building a full scale Statistical Machine Translation SMT system involves many preparation and training steps and it consists of several components each of which contribute to the overall system performance. Between 2007 and 2008 our system improved by 5 points in BLEU from to for the unseen MT06 test set which can be mainly attributed to two major points. The fast growth of computing resources over the years make it possible to use larger and larger amounts of data in training. In Section 3 we show how parallelizing model training can reduce training time by an order of magnitude and how using larger training data as well as more extensive models improve translation quality. Word reordering is still a difficult problem in SMT. In Section 4 we apply a Part Of Speech POS based syntactic reordering model successfully to our large Chinese system. Decoder Our translation system is based on the CMU SMT decoder as described in Hewavitharana et al. 2005 . Our decoder is a phrase-based beam search decoder which combines multiple models . phrase tables several language models a distortion model ect. in a log-linear fashion. In order to find an optimal set of weights we use MER training as described in Venugopal et al. 2005 which uses rescoring of the top n hypotheses to maximize an evaluation metric like BLEU or

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.