Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "AUTOMATIC ALIGNMENT IN PARALLEL CORPORA"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

This paper addresses the alignment issue in the framework of exploitation of large bimultilingual corpora for translation purposes. A generic alignment scheme is proposed that can meet varying requirements of different applications. Depending on the level at which alignment is sought, appropriate surface linguistic information is invoked coupled with information about possible unit delimiters. Each text unit (sentence, clause or phrase) is represented by the sum of its content tags. The results are then fed into a dynamic programming framework that computes the optimum alignment of units. . | AUTOMATIC ALIGNMENT IN PARALLEL CORPORA Harris Papageorgiou Lambros Cranias Stelios Piperidis Institute for Language and speech Processing 22 Margari Street 115 25 Athens Greece Stelios.Piperidis@eurokom.ie ABSTRACT This paper addresses the alignment issue in the framework of exploitation of large bi-multilingual corpora for translation purposes. A generic alignment scheme is proposed that can meet varying requữements of different applications. Depending on the level at which alignment is sought appropriate surface linguistic information is invoked coupled with information about possible unit delimiters. Each text unit sentence clause or phrase is represented by the sum of its content tags. The results are then fed into a dynamic programming framework that computes the optimum alignment of units. The proposed scheme has been tested at sentence level on parallel corpora of the CELEX database. The success rate exceeded 99 . The next steps of the work concern the testing of the scheme s efficiency at lower levels endowed with necessary bilingual information about potential delimiters. INTRODUCTION Parallel linguistically meaningful text units are indispensable in a number of NLP and lexicographic applications and recently in the so called Example-Based Machine Translation EBMT . As regards EBMT a large amount of bi-multilingual translation examples is stored in a database and input expressions are rendered in the target language by retrieving from the database that example which is most similar to the input. A task of crucial importance in this framework is the establishment of correspondences between units of multilingual texts at sentence phrase or even word level. The adopted criteria for ascertaining the adequacy of alignment methods are stated as follows 1This research was supported by the LRE I TRANSLEARN project of the European Union an alignment scheme must cope with the embedded extra-linguistic data tables anchor points SGML markers etc and theữ possible .