tailieunhanh - Báo cáo khoa học: "Reliable Measures for Aligning Japanese-English News Articles and Sentences"

We have aligned Japanese and English news articles and sentences to make a large parallel corpus. We first used a method based on cross-language information retrieval (CLIR) to align the Japanese and English articles and then used a method based on dynamic programming (DP) matching to align the Japanese and English sentences in these articles. However, the results included many incorrect alignments. | Reliable Measures for Aligning Japanese-English News Articles and Sentences Masao Utiyama and Hitoshi Isahara Communications Research Laboratory 3-5 Hikari-dai Seika-cho Souraku-gun Kyoto 619-0289 Japan mutiyama@ and isahara@ Abstract We have aligned Japanese and English news articles and sentences to make a large parallel corpus. We first used a method based on cross-language information retrieval CLIR to align the Japanese and English articles and then used a method based on dynamic programming DP matching to align the Japanese and English sentences in these articles. However the results included many incorrect alignments. To remove these we propose two measures scores that evaluate the validity of alignments. The measure for article alignment uses similarities in sentences aligned by DP matching and that for sentence alignment uses similarities in articles aligned by CLIR. They enhance each other to improve the accuracy of alignment. Using these measures we have successfully constructed a large-scale article and sentence alignment corpus available to the public. 1 Introduction A large-scale Japanese-English parallel corpus is an invaluable resource in the study of natural language processing NLP such as machine translation and cross-language information retrieval CLIR . It is also valuable for language education. However no such corpus has been available to the public. We recently have obtained a noisy parallel corpus of Japanese and English newspapers consisting of issues published over more than a decade and have tried to align their articles and sentences. We first aligned the articles using a method based on CLIR Collier et al. 1998 Matsumoto and Tanaka 2002 and then aligned the sentences in these articles by using a method based on dynamic programming DP matching Gale and Church 1993 Utsuro et al. 1994 . However the results included many incorrect alignments due to noise in the corpus. To remove these we propose two measures scores that .