tailieunhanh - Báo cáo khoa học: "ALIGNING SENTENCES IN BILINGUAL CORPORA USING LEXICAL INFORMATION"

In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ignore word identities and only consider sentence length (Brown el al., 1991b; Gale and Church, 1991). Our algorithm constructs a simple statistical word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. | ALIGNING SENTENCES IN BILINGUAL CORPORA USING LEXICAL INFORMATION Stanley F. Chen Aiken Computation Laboratory Division of Applied Sciences Harvard University Cambridge MA 02138 Internet sfc@ Abstract In this paper we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ignore word identities and only consider sentence length Brown et al. 1991b Gale and Church 1991 . Our algorithm constructs a simple statistical word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. We have achieved an error rate of approximately on Canadian Hansard data which is a significant improvement over previous results. The algorithm is language independent. 1 Introduction In this paper we describe an algorithm for aligning sentences with their translations in a bilingual corpus. Aligned bilingual corpora have proved useful in many tasks including machine translation Brown ei al. 1990 Sadler 1989 sense disambiguation Brown et al. 1991a Dagan et al. 1991 Gale et al. 1992 and bilingual lexicography Klavans and Tzoukermann 1990 Warwick and Russell 1990 . The task is difficult because sentences frequently do not align one-to-one. Sometimes sentences align many-to-one and often there are deletions in The author wishes to thank Peter Brown Stephen Del-laPietra Vincent DellaPietra and Robert Mercer for their suggestions support and relentless taunting. The author also wishes to thank Jan Hajic and Meredith Goldsmith as well as the aforementioned for checking the alignments produced by the implementation. one of the supposedly parallel corpora of a bilingual corpus. These deletions can be substantial in the Canadian Hansard corpus there are many deletions of several thousand sentences and one deletion of over 90 000 sentences. Previous work includes Brown et al. 1991b and Gale and Church 1991