tailieunhanh - Báo cáo khoa học: "ALIGNING A PARALLEL ENGLISH-CHINESE CORPUS STATISTICALLY WITH LEXICAL CRITERIA"

We describe our experience with automatic alignment of sentences in parallel English-Chinese texts. Our report concerns three related topics: (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus; (2) experiments addressing the applicability of Gale ~ Church's (1991) lengthbased statistical method to the task of alignment involving a non-Indo-European language; and (3) an improved statistical method that also incorporates domain-specific lexical cues. INTRODUCTION Recently, a number of automatic techniques for aligning sentences in parallel bilingual corpora have been proposed (Kay & RSscheisen 1988; Catizone e~ al. 1989; Gale & Church 1991; . | ALIGNING A PARALLEL ENGLISH-CHINESE CORPUS STATISTICALLY WITH LEXICAL CRITERIA Dekai Wu HKUST Department of Computer Science University of Science Technology Clear Water Bay Hong Kong Internet dekaiQcs .ust .hk Abstract We describe our experience with automatic alignment of sentences in parallel English-Chinese texts. Our report concerns three related topics 1 progress on the HKUST English-Chinese Parallel Bilingual Corpus 2 experiments addressing the applicability of Gale Church s 1991 lengthbased statistical method to the task of alignment involving a non-Indo-European language and 3 an improved statistical method that also incorporates domain-specific lexical cues. INTRODUCTION Recently a number of automatic techniques for aligning sentences in parallel bilingual corpora have been proposed Kay Roscheisen 1988 Catizone el al. 1989 Gale Church 1991 Brown et al. 1991 Chen 1993 and coarser approaches when sentences are difficult to identify have also been advanced Church 1993 Dagan et al. 1993 . Such corpora contain the same material that has been translated by human experts into two languages. The goal of alignment is to identify matching sentences between the languages. Alignment is the first stage in extracting structural information and statistical parameters from bilingual corpora. The problem is made more difficult because a sentence in one language may correspond to multiple sentences in the other worse yet sometimes several sentences content is distributed across multiple translated sentences. Approaches to alignment fall into two main classes lexical and statistical. Lexically-based techniques use extensive online bilingual lexicons to match sentences. In contrast statistical techniques require almost no prior knowledge and are based solely on the lengths of sentences. The empirical results to date suggest that statistical methods yield performance superior to that of currently available lexical techniques. However as far as we know the literature on .