tailieunhanh - Báo cáo khoa học: "A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA"

Researchers in both machine Iranslation (., Brown et al., 1990) and bilingual lexicography (., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts, based on a simple statistical model of character lengths. The method was developed and tested on a small trilingual sample of Swiss economic reports. A much larger sample of 90 million words of Canadian Hansards has been aligned and donated to the ACL/DCI. . | A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA William A. Gale Kenneth w. Church AT T Bell Laboratories 600 Mountain Avenue Murray Hill NJ 07974 ABSTRACT Researchers in both machine tfanslation . Brown et al. 1990 and bilingual lexicography . Klavans and Tzoukermann 1990 have recently become interested in studying parallel texts texts such as the Canadian Hansards parliamentary proceedings which are available in multiple languages French and English . This paper describes a method for aligning sentences in these parallel texts based on a simple statistical model of character lengths. The method was developed and tested on a small trilingual sample of Swiss economic reports. A much larger sample of 90 million words of Canadian Hansards has been aligned and donated to the ACL DCI. 1. Introduction Researchers in both machine translation . Brown et al 1990 and bilingual lexicography . Klavans and Tzoukermann 1990 have recently become interested in studying bilingual corpora bodies of text such as the Canadian Hansards parliamentary debates which are available in multiple languages such as French and English . The sentence alignment task is to identify correspondences between sentences in one language and sentences in the other language. This task is a first step toward the more ambitious task finding correspondances among The input is a paữ of texts such as Table 1. 1. In statistics string matching problems are divided into two classes alignment problems and correspondance problems. Crossing dependencies are possible in the latter but not in the former. Table 1 Input to Alignment Program English_____________________________________ According to our survey 1988 sales of mineral water and soft drinks were much higher than in 1987 reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates. The higher turnover was largely due to an increase in the sales volume. Employment and .