tailieunhanh - Báo cáo khoa học: "Feature-based Method for Document Alignment in Comparable News Corpora"

In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and EnglishMalay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. . | Feature-based Method for Document Alignment in Comparable News Corpora Thuy Vu Ai Ti Aw Min Zhang Department of Human Language Technology Institute for Infocomm Research 1 Fusionopolis Way 21-01 Connexis South Tower Singapore 138632 tvu aaiti mzhang @ Abstract In this paper we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes and 8 to performance improvement over Pearson s correlation method on the two comparable corpora. In addition when more heuristic and statistical features as well as a bilingual dictionary are utilized our method shows an absolute performance improvement of and on the two sets of bilingual corpora when comparing with a prior information retrieval-based method. 1 Introduction The problem of document alignment is described as the task of aligning documents news articles for instance across two corpora based on content similarity. The groups of corpora can be in the same or in different languages depending on the purpose of one s task. In our study we attempt to align similar documents across comparable corpora which are bilingual each set written in a different language but having similar content and domain coverage for different communication needs. Previous works on monolingual document alignment focus on automatic alignment between documents and their presentation slides or between documents and their abstracts. Kan 2007 uses two similarity measures Cosine and Jac-card to calculate the candidate alignment score in his SlideSeer .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN