tailieunhanh - Báo cáo khoa học: "A Beam-Search Extraction Algorithm for Comparable Data"

This paper extends previous work on extracting parallel sentence pairs from comparable data (Munteanu and Marcu, 2005). For a given source sentence S, a maximum entropy (ME) classifier is applied to a large set of candidate target translations . A beam-search algorithm is used to abandon target sentences as non-parallel early on during classification if they fall outside the beam. This way, our novel algorithm avoids any document-level prefiltering step. | A Beam-Search Extraction Algorithm for Comparable Data Christoph Tillmann IBM . Watson Research Center Yorktown Heights . 10598 ctill@ Abstract This paper extends previous work on extracting parallel sentence pairs from comparable data Munteanu and Marcu 2005 . For a given source sentence S a maximum entropy ME classifier is applied to a large set of candidate target translations . A beam-search algorithm is used to abandon target sentences as non-parallel early on during classification if they fall outside the beam. This way our novel algorithm avoids any document-level prefiltering step. The algorithm increases the number of extracted parallel sentence pairs significantly which leads to a BLEU improvement of about 1 on our Spanish-English data. 1 Introduction The paper presents a novel algorithm for extracting parallel sentence pairs from comparable monolingual news data. We select source-target sentence pairs S T based on a ME classifier Munteanu and Marcu 2005 . Because the set of target sentences T considered can be huge previous work Fung and Cheung 2004 Resnik and Smith 2003 Snover et al. 2008 Munteanu and Marcu 2005 pre-selects target sentences T at the document level . We have re-implemented a particular filtering scheme based on BM25 Quirk et al. 2007 Utiyama and Isahara 2003 Robertson et al. 1995 . In this paper we demonstrate a different strategy . We compute the ME score incrementally at the word level and apply a beamsearch algorithm to a large number of sentences. We abandon target sentences early on during classification if they fall outside the beam. For comparison purposes we run our novel extraction algorithm with and without the document-level prefiltering step. The results in Section 4 show that the number of extracted sentence pairs is more than doubled which also leads to an increase in BLEU by about 1 on the Spanish-English data. The classification probability is defined as follows v C ST exp wT fc - T 1 p c S T Z S T 1 where

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.