tailieunhanh - Báo cáo khoa học: "Active Learning-Based Elicitation for Semi-Supervised Word Alignment"
Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner. . | Active Learning-Based Elicitation for Semi-Supervised Word Alignment Vamshi Ambati Stephan Vogel and Jaime Carbonell vamshi vogel jgc @ Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213 USA Abstract Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty- margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner. 1 Introduction Corpus-based approaches to machine translation have become predominant with phrase-based statistical machine translation PB-SMT Koehn et al. 2003 being the most actively progressing area. The success of statistical approaches to MT can be attributed to the IBM models Brown et al. 1993 that characterize word-level alignments in parallel corpora. Parameters of these alignment models are learnt in an unsupervised manner using the EM algorithm over sentence-level aligned parallel corpora. While the ease of automatically aligning sentences at the word-level with tools like GIZA Och and Ney 2003 has enabled fast development of SMT systems for various language pairs the quality of alignment is typically quite low for language pairs like Chinese-English Arabic-English that diverge from the independence assumptions made by the generative models. Increased parallel data enables better estimation of the model parameters but a large number of language pairs still lack such resources. Two directions of research have been pursued for improving generative word alignment. The first is to relax or update the independence assumptions based on more information usually syntactic .
đang nạp các trang xem trước