tailieunhanh - Báo cáo khoa học: "Data Cleaning for Word Alignment"

Parallel corpora are made by human beings. However, as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings, it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore, they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n : m mapping objects, such as paraphrases, non-literal translations, and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the. | Data Cleaning for Word Alignment Tsuyoshi Okita CNGL School of Computing Dublin City University Glasnevin Dublin 9 tokita@ Abstract Parallel corpora are made by human beings. However as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n m mapping objects such as paraphrases non-literal translations and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the assumption that their frequency is low such as below 5 percent. We show an improvement of Bleu score from to in English-Spanish and from to in German-English. 1 Introduction Phrase alignment Marcu and Wong 02 has recently attracted researchers in its theory although it remains in infancy in its practice. However a phrase extraction heuristic such as grow-diag-final Koehn et al. 05 Och and Ney 03 which is a single difference between word-based SMT Brown et al. 93 and phrase-based SMT Koehn et al. 03 where we construct word-based SMT by bidirectional word alignment is nowadays considered to be a key process which leads to an overall improvement of MT systems. However technically this phrase extraction process after word alignment is known to have at least two limitations 1 the objectives of uni-directional word alignment is limited only in 1 n mappings and 2 an atomic unit of phrase pair used by phrase ex traction is thus basically restricted in 1 n or n 1 with small exceptions. Firstly the posterior-based approach Liang 06 looks at the posterior probability and partially delays the alignment decision. However this approach does not have any extension in its 1 n .

TỪ KHÓA LIÊN QUAN