tailieunhanh - Báo cáo khoa học: "On the use of Comparable Corpora to improve SMT performance"

We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the nonparallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French/English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems. . | On the use of Comparable Corpora to improve SMT performance Sadaf Abdul-Rauf and Holger Schwenk LIUM University of Le Mans FRANCE Abstract We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation SMT system built from small amounts of parallel texts to translate the source side of the nonparallel corpus. The target side texts are used along with other corpora in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems. 1 Introduction Parallel corpora have proved be an indispensable resource in Statistical Machine Translation SMT . A parallel corpus also called bitext consists in bilingual texts aligned at the sentence level. They have also proved to be useful in a range of natural language processing applications like automatic lexical acquisition cross language information retrieval and annotation projection. Unfortunately parallel corpora are a limited resource with insufficient coverage of many language pairs and application domains of interest. The performance of an SMT system heavily depends on the parallel corpus used for training. Generally more bitexts lead to better performance. Current resources of parallel corpora cover few language pairs and mostly come from one domain proceedings of the Canadian or European Parliament or of the United Nations . This becomes specifically problematic when SMT systems trained on such corpora are used for general translations as the language jargon heavily used in these corpora is not appropriate for everyday life translations or translations in some other domain. One option to increase this scarce resource could be to produce more human translations but this is a .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.