tailieunhanh - Báo cáo khoa học: "Reducing SMT Rule Table with Monolingual Key Phrase"

This paper presents an effective approach to discard most entries of the rule table for statistical machine translation. The rule table is filtered by monolingual key phrases, which are extracted from source text using a technique based on term extraction. Experiments show that 78% of the rule table is reduced without worsening translation performance. In most cases, our approach results in measurable improvements in BLEU score. that a source phrase is either a flat phrase consists of words, or a hierarchical phrase consists of both words and variables. . | Reducing SMT Rule Table with Monolingual Key Phrase Zhongjun He Yao Mengt Yajuan Lũ Hao Yu Qun Liu Fujitsu R D Center CO. LTD Beijing China hezhongjun mengyao yu @ Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences Beijing China Ivyajuan liuqun @ Abstract This paper presents an effective approach to discard most entries of the rule table for statistical machine translation. The rule table is filtered by monolingual key phrases which are extracted from source text using a technique based on term extraction. Experiments show that 78 of the rule table is reduced without worsening translation performance. In most cases our approach results in measurable improvements in BLEU score. 1 Introduction In statistical machine translation SMT community the state-of-the-art method is to use rules that contain hierarchical structures to model translation such as the hierarchical phrase-based model Chiang 2005 . Rules are more powerful than conventional phrase pairs because they contain structural information for capturing long distance reorderings. However hierarchical translation systems often suffer from a large rule table the collection of rules which makes decoding slow and memory-consuming. In the training procedure of SMT systems numerous rules are extracted from the bilingual corpus. During decoding however many of them are rarely used. One of the reasons is that these rules have low quality. The rule quality are usually evaluated by the conditional translation probabilities which focus on the correspondence between the source and target phrases while ignore the quality of phrases in a monolingual corpus. In this paper we address the problem of reducing the rule table with the information of monolingual corpus. We use C-value a measurement of automatic term recognition to score source phrases. A source phrase is regarded as a key phrase if its score greater than a threshold. Note that a