tailieunhanh - Báo cáo khoa học: "Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction"

We propose a bootstrapping approach to training a memoriless stochastic transducer for the task of extracting transliterations from an English-Arabic bitext. The transducer learns its similarity metric from the data in the bitext, and thus can function directly on strings written in different writing scripts without any additional language knowledge. We show that this bootstrapped transducer performs as well or better than a model designed specifically to detect Arabic-English transliterations. . | Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction Tarek Sherif and Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton Alberta Canada T6G 2E8 tarek kondrak @ Abstract We propose a bootstrapping approach to training a memoriless stochastic transducer for the task of extracting transliterations from an English-Arabic bitext. The transducer learns its similarity metric from the data in the bitext and thus can function directly on strings written in different writing scripts without any additional language knowledge. We show that this bootstrapped transducer performs as well or better than a model designed specifically to detect Arabic-English transliterations. 1 Introduction Transliterations are words that are converted from one writing script to another on the basis of their pronunciation rather than being translated on the basis of their meaning. Transliterations include named entities . tj Jane Austen and lexical loans . y_ỵùAj television . An algorithm to detect transliterations automatically in a bitext can be an effective tool for many tasks. Models of machine transliteration such as those presented in Al-Onaizan and Knight 2002 or AbdulJaleel and Larkey 2003 require a large set of sample transliterations to use for training. If such a training set is unavailable for a particular language pair a detection algorithm would lead to a significant gain in time over attempting to build the set manually. Algorithms for cross-language information retrieval often encounter the problem of out-ofvocabulary words or words not present in the algo-864 rithm s lexicon. Often a significant proportion of these words are named entities and thus are candidates for transliteration. A transliteration detection algorithm could be used to map named entities in a query to potential transliterations in the target language text. The main challenge in transliteration detection lies in the fact that .