tailieunhanh - Báo cáo khoa học: "ALIGNING SENTENCES IN PARALLEL CORPORA"
In this paper we describe a statistical technique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our , the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment computation is fast and therefore practical for application to very large collections of text. | ALIGNING SENTENCES IN PARALLEL CORPORA Peter F. Brown Jennifer c. Lai and Robert L. Mercer IBM Thomas J. Watson Research Center . Box 704 Yorktown Heights NY 10598 ABSTRACT In this paper we describe a statistical technique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our data the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence the alignment computation is fast and therefore practical for application to very large collections of text. We have used this technique to align several million sentences in the English-French Hansard corpora and have achieved an accuracy in excess of 99 in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96 and 97 . Thus the technique may be applicable to a wider variety of texts than we have yet tried. INTRODUCTION Recent work by Brown et al. Brown et al. 1988 Brown et al. 1990 has quickened anew the long dormant idea OÍ using statistical techniques to carry out machine translation from one natural language to another. The lynchpin of their approach is a large collection of pairs of sentences that are mutual translations. Beyond providing grist to the statistical mill such pairs of sentences are valuable to researchers in bilingual lexicography Kla-vans and Tzoukermann 1990 Warwick and Russell 1990 and may be useful in other approaches to machine translation Sadler. 1989 . In this paper we consider the problem of extracting from parallel French and English corpora pairs sentences that are translations of one another. The task is not trivial because at times a. single sentence in one language is translated as two or more
đang nạp các trang xem trước