tailieunhanh - Báo cáo khoa học: "Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation"
In this paper, we address statistical machine translation of public conference talks. Modeling the style of this genre can be very challenging given the shortage of available in-domain training data. We investigate the use of a hybrid LM, where infrequent words are mapped into classes. Hybrid LMs are used to complement word-based LMs with statistics about the language style of the talks. Extensive experiments comparing different settings of the hybrid LM are reported on publicly available benchmarks based on TED talks, from Arabic to English and from English to French. The proposed models show to better exploit in-domain data. | Cutting the Long Tail Hybrid Language Models for Translation Style Adaptation Arianna Bisazza and Marcello Federico Fondazione Bruno Kessler Trento Italy bisazza federico @ Abstract In this paper we address statistical machine translation of public conference talks. Modeling the style of this genre can be very challenging given the shortage of available in-domain training data. We investigate the use of a hybrid LM where infrequent words are mapped into classes. Hybrid LMs are used to complement word-based LMs with statistics about the language style of the talks. Extensive experiments comparing different settings of the hybrid LM are reported on publicly available benchmarks based on TED talks from Arabic to English and from English to French. The proposed models show to better exploit in-domain data than conventional word-based LMs for the target language modeling component of a phrase-based statistical machine translation system. 1 Introduction The translation of TED conference talks1 is an emerging task in the statistical machine translation SMT community Federico et al. 2011 . The variety of topics covered by the speeches as well as their specific language style make this a very challenging problem. Fixed expressions colloquial terms figures of speech and other phenomena recurrent in the talks should be properly modeled to produce translations that are not only fluent but that also employ the right register. In this paper we propose a language modeling technique that leverages indomain training data for style adaptation. 1http talks Hybrid class-based LMs are trained on text where only infrequent words are mapped to Part-of-Speech POS classes. In this way topicspecific words are discarded and the model focuses on generic words that we assume more useful to characterize the language style. The factorization of similar expressions made possible by this mixed text representation yields a better ngram coverage but with a much higher .
đang nạp các trang xem trước