tailieunhanh - Báo cáo khoa học: "Bitext Correspondences through Rich Mark-up"

Rich mark-up can considerably benefit the process of establishing bitext correspondences, that is, the task of providing correct identification and alignment methods for text segments that are translation equivalences of each other in a parallel corpus. We present a sentence alignment algorithm that, by taking advantage of previously annotated texts, obtains accuracy rates close to 100%. The algorithm evaluates the similarity of the linguistic and extralinguistic mark-up in both sides of a bitext. . | Bitext Correspondences through Rich Mark-up Raquel Martinez Departamento de Sis. Informaticos y Programación Facultad de Matemáticas Universidad Complutense de Madrid e-mail Joseba Abaitua Facultad de Filosoffa y Letras Universidad de Deusto Bilbao e-mail abaitua@ Arantza Casillas Departamento de Automática Universidad de Alcala de Henares e-mail Abstract Rich mark-up can considerably benefit the process of establishing bitext correspondences that is the task of providing correct identification and alignment methods for text segments that are translation equivalences of each other in a parallel corpus. We present a sentence alignment algorithm that by taking advantage of previously annotated texts obtains accuracy rates close to 100 . The algorithm evaluates the similarity of the linguistic and extra-linguistic mark-up in both sides of a bitext. Given that annotations are neutral with respect to typological grammatical and orthographical differences between languages rich mark-up becomes an optimal foundation to support bitext correspondences. The main originality of this approach is that it makes maximal use of annotations which is a very sensible and efficient method for the exploitation of parallel corpora when annotations exist. 1 Introduction Adequate encoding schemes applied to large bodies of text in electronic form have been a main achievement in the field of humanities computing. Research in computational linguistics which since the late 1980s has resorted to methodologies involving statistics and probabilities in large corpora has however largely neglected the existence and provision of extra information from such encoding schemes. In this paper we present an approach to sentence alignment that crucially relies on previously introduced annotations in a parallel corpus. Following Harris 88 corpora containing bilingual texts have been called bitexts Melamed 97 Martinez et al. 97 . The utility of .

TÀI LIỆU LIÊN QUAN