tailieunhanh - Báo cáo khoa học: "An Efficient Method for Determining Bilingual Word Classes"

In statistical natural language processing we always face the problem of sparse data. One way to reduce this problem is to group words into equivalence classes which is a standard method in statistical language modeling. In this paper we describe a method to determine bilingual word classes suitable for statistical machine translation. We develop an optimization criterion based on a maximumlikelihood approach and describe a clustering algorithm. We will show that the usage of the bilingual word classes we get can improve statistical machine translation. . | Proceedings of EACL 99 An Efficient Method for Determining Bilingual Word Classes Franz Josef Och Lehrstuhl fur Informatik VI RWTH Aachen - University of Technology Ahornstrafie 55 52056 Aachen Germany och@ Abstract In statistical natural language processing we always face the problem of sparse data. One way to reduce this problem is to group words into equivalence classes which is a standard method in statistical language modeling. In this paper we describe a method to determine bilingual word classes suitable for statistical machine translation. We develop an optimization criterion based on a maximumlikelihood approach and describe a clustering algorithm. We will show that the usage of the bilingual word classes we get can improve statistical machine translation. 1 Introduction Word classes are often used in language modelling to solve the problem of sparse data. Various clustering techniques have been proposed Brown et al. 1992 Jardino and Adda 1993 Martin et al. 1998 which perform automatic word clustering optimizing a maximum-likelihood criterion with iterative clustering algorithms. In the field of statistical machine translation we also face the problem of sparse data. Our aim is to use word classes in statistical machine translation to allow for more robust statistical translation models. A naive approach for doing this would be the use of mono-lingually optimized word classes in source and target language. Unfortunately we can not expect these independently optimized classes to be correspondent. Therefore mono-lingually optimized word classes do not seem to be useful for machine translation see also Fung and Wu 1995 . We define bilingual word clustering as the process of forming corresponding word classes suitable for machine translation purposes for a pair of languages using a parallel training corpus. The described method to determine bilingual word classes is an extension and improvement of the method mentioned in Och and Weber

TỪ KHÓA LIÊN QUAN