tailieunhanh - Báo cáo khoa học: "An Alignment Method for Noisy Parallel Corpora based on Image Processing Techniques"
This paper presents a new approach to bitext correspondence problem (BCP) of noisy bilingual corpora based on image processing (IP) techniques. By using one of several ways of estimating the lexical translation probability (LTP) between pairs of source and target words, we can turn a bitext into a discrete gray-level image. We contend that the BCP, when seen in this light, bears a striking resemblance to the line detection problem in IP. Therefore, BCPs, including sentence and word alignment, can benefit from a wealth of effective, well established IP techniques, including convolution-based filters, texture analysis and Hough transform. . | An Alignment Method for Noisy Parallel Corpora based on Image Processing Techniques Jason s. Chang and Mathis H. Chen Department of Computer Science National Tsing Hua University Taiwan jschang@ mathis @ Phone 886-3-5731069 Fax 886-3-5723694 Abstract This paper presents a new approach to bitext correspondence problem BCP of noisy bilingual corpora based on image processing IP techniques. By using one of several ways of estimating the lexical translation probability LTP between pairs of source and target words we can turn a bitext into a discrete gray-level image. We contend that the BCP when seen in this light bears a striking resemblance to the line detection problem in IP. Therefore BCPs including sentence and word alignment can benefit from a wealth of effective well established IP techniques including convolution-based filters texture analysis and Hough transform. This paper describes a new program PlotAlign that produces a word-level bitext map for noisy or non-literal bitext based on these techniques. Keywords alignment bilingual corpus image processing 1. Introduction Aligned corpora have proved very useful in many tasks including statistical machine translation bilingual lexicography Daille Gaussier and Lange 1993 and word sense disambiguation Gale Church and Yarowsky 1992 Chen Ker Sheng and Chang 1997 . Several methods have recently been proposed for sentence alignment of the Hansards an English-French corpus of Canadian parliamentary debates Brown Lai and Mercer 1991 Gale and Church 1991a Simard Foster and Isabelle 1992 Chen 1993 and for other language pahs such as English-German English-Chinese and English-Japanese Church Dagan Gale Fung Helfman and Satish 1993 Kay and Rõscheisen 1993 Wu 1994 . The statistical approach to machine translation SMT can be understood as a word-by-word model consisting of two sub-models a language model for generating a source text segment s and a translation model for mapping s to its .
đang nạp các trang xem trước