tailieunhanh - Báo cáo khoa học: "Combination of Arabic Preprocessing Schemes for Statistical Machine Translation"

Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. . | Combination of Arabic Preprocessing Schemes for Statistical Machine Translation Fatiha Sadat Nizar Habash Institute for Information Technology Center for Computational Learning Systems National Research Council of Canada Columbia University habash@ Abstract Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality. 1 Introduction Statistical machine translation SMT is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result there is a wide range of possible preprocessing choices for data used in SMT. This is even more so for morphologically rich languages such as Arabic. We use the term preprocessing to describe various input modifications applied to raw training and testing texts for SMT. Preprocessing includes different kinds of to-kenization stemming part-of-speech POS tagging and lemmatization. The ultimate goal of preprocessing is to improve the quality of the SMT output by addressing issues such as sparsity in training data. We refer to a specific kind of preprocessing as a scheme and differentiate it from the technique used to obtain it. In a previous publication we presented results describing six pre processing schemes for Arabic Habash and Sadat 2006 . These schemes were evaluated against three different techniques that vary in linguistic complexity and across a learning .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.