tailieunhanh - Báo cáo khoa học: "Enhancing Statistical Machine Translation with Character Alignment"

The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. | Enhancing Statistical Machine Translation with Character Alignment Ning Xi Guangchao Tang Xinyu Dai Shujian Huang Jiajun Chen State Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University Nanjing 210046 China xin tanggc dxy huangsj chenjj @ Abstract The dominant practice of statistical machine translation SMT uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this we propose a framework that uses two different segmentation specifications for alignment and translation respectively we use Chinese character as the basic unit for alignment and then convert this alignment to conventional word alignment for translation rule induction. Experimentally our approach outperformed two baselines fully word-based system using word for both alignment and translation and fully character-based system in terms of alignment quality and translation performance. 1 Introduction Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation SMT because Chinese sentences do not delimit words by spaces. The key characteristic of a Chinese word segmenter is the segmentation specifi-cation1. As depicted in Figure 1 a the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction. For brevity we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment WSA for short because it determines the basic tokens for alignment and refer to the word segmentation of the aligned corpus as word segmentation for rules WSR for short because it determines the basic tokens of translation Bilingual Corpus WSA t Word alignment Aligned Corpus WSA f Rule induction Translation Rules WSR f .

TỪ KHÓA LIÊN QUAN