tailieunhanh - Báo cáo khoa học: "Corpus Effects on the Evaluation of Automated Transliteration Systems"

Most current machine transliteration systems employ a corpus of known sourcetarget word pairs to train their system, and typically evaluate their systems on a similar corpus. In this paper we explore the performance of transliteration systems on corpora that are varied in a controlled way. In particular, we control the number, and prior language knowledge of human transliterators used to construct the corpora, and the origin of the source words that make up the corpora. | Corpus Effects on the Evaluation of Automated Transliteration Systems Sarvnaz Karimi Andrew Turpin Falk Scholer School of Computer Science and Information Technology RMIT University GPO Box 2476V Melbourne 3001 Australia sarvnaz aht fscholer @ Abstract Most current machine transliteration systems employ a corpus of known sourcetarget word pairs to train their system and typically evaluate their systems on a similar corpus. In this paper we explore the performance of transliteration systems on corpora that are varied in a controlled way. In particular we control the number and prior language knowledge of human transliterators used to construct the corpora and the origin of the source words that make up the corpora. We find that the word accuracy of automated transliteration systems can vary by up to 30 in absolute terms depending on the corpus on which they are run. We conclude that at least four human transliterators should be used to construct corpora for evaluating automated transliteration systems and that although absolute word accuracy metrics may not translate across corpora the relative rankings of system performance remains stable across differing corpora. 1 Introduction Machine transliteration is the process of transforming a word written in a source language into a word in a target language without the aid of a bilingual dictionary. Word pronunciation is preserved as far as possible but the script used to render the target word is different from that of the source language. Transliteration is applied to proper nouns and out-of-vocabulary terms as part of machine translation and cross-lingual information retrieval CLIR Ab-dulJaleel and Larkey 2003 Pirkola et al. 2006 . 640 Several transliteration methods are reported in the literature for a variety of languages with their performance being evaluated on multilingual corpora. Source-target pairs are either extracted from bilingual documents or dictionaries AbdulJaleel and Larkey 2003 Bilac and

TÀI LIỆU LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG