tailieunhanh - Báo cáo khoa học: "A Portable Algorithm for Mapping Bitext Correspondence"

The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations ( b i t e x t m a p s ) . The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorithms in the literature. The algorithm is robust enough to use on noisy texts, such as those resulting from OCR input, and on translations that are not very literal. SIMR encapsulates its. | A Portable Algorithm for Mapping Bitext Correspondence I. Dan Melamed Dept of Computer and Information Science University of Pennsylvania Philadelphia PA 19104 . Abstract The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations bitext maps . The Smooth Injective Map Recognizer SIMR algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorithms in the literature. The algorithm is robust enough to use on noisy texts such as those resulting from OCR input and on translations that are not very literal. SIMR encapsulates its language-specific heuristics so that it can be ported to any language pair with a minimal effort. 1 Introduction Texts that are available in two languages bitexts are immensely valuable for many natural language processing applications1. Bitexts are the raw material from which translation models are built. In addition to their use in machine translation Sato Nagao 1990 Brown et al. 1993 Melamed 1997 translation models can be applied to machine-assisted translation Sato 1992 Foster et al. 1996 cross-lingual information retrieval SIGIR 1996 and gisting of World Wide Web pages Resnik 1997 . Bitexts also play a role in less automated applications such as concordancing for bilingual lexicography Catizone et al. 1993 Gale Church 1991b computer-assisted language learning and tools for translators . Macklovitch 1 Multitexts in more than two languages axe even more valuable but they Eire much more rare. 1995 Melamed 1996b . However bitexts are of little use without an automatic method for constructing bitext maps. Bitext maps identify corresponding text units between the two halves of a bitext. The ideal bitext mapping algorithm should be fast and accurate use little memory and degrade gracefully when .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN