tailieunhanh - Báo cáo khoa học: "Domain Adaptation for Machine Translation by Mining Unseen Words"
We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between and Bleu points) on four domains and two language pairs. . | Domain Adaptation for Machine Translation by Mining Unseen Words Hal Daume III University of Maryland Collge Park USA hal@ Jagadeesh Jagarlamudi University of Maryland College Park USA jags@ Abstract We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora Haghighi et al. 2008 we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system yielding consistent improvements in translations quality between and Bleu points on four domains and two language pairs. 1 Introduction Large amounts of data are currently available to train statistical machine translation systems. Unfortunately these training data are often qualitatively different from the target task of the translation system. In this paper we consider one specific aspect of domain divergence Jiang 2008 Blitzer and Daume III 2010 the out-of-vocabulary problem. By considering four different target domains news medical movie subtitles technical documentation in two source languages German French we 1 Ascertain the degree to which domain divergence causes increases in unseen words and the degree to which this degrades translation performance. For instance if all unknown words are names then copying them verbatim may be sufficient. 2 Extend known methods for mining dictionaries from comparable corpora to the domain adaptation setting by bootstrapping them based on known translations from the source domain. 3 407 Develop methods for integrating these mined dictionaries into a phrase-based translation system Koehn et al. 2007 . As we shall see for most target domains out of vocabulary terms are the source of approximately half of the additional errors made. The only exception is the news domain which is sufficiently similar to parliament proceedings Europarl .
đang nạp các trang xem trước