tailieunhanh - Báo cáo khoa học: "Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation"

We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it, gathering annotations via Amazon Mechanical Turk, and find that we get an order of magnitude increase in performance rates of improvement. | Bucking the Trend Large-Scale Cost-Focused Active Learning for Statistical Machine Translation Michael Bloodgood Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD 21211 bloodgood@ Chris Callison-Burch Center for Language and Speech Processing Johns Hopkins University Baltimore MD 21211 ccb@ Abstract We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it gathering annotations via Amazon Mechanical Turk and find that we get an order of magnitude increase in performance rates of improvement. 1 Introduction Figure 1 shows the learning curves for two state of the art statistical machine translation SMT systems for Urdu-English translation. Observe how the learning curves rise rapidly at first but then a trend of diminishing returns occurs put simply the curves flatten. This paper investigates whether we can buck the trend of diminishing returns and if so how we can do it effectively. Active learning AL has been applied to SMT recently Haffari et al. 2009 Haffari and Sarkar 2009 but they were interested in starting with a tiny seed set of data and they stopped their investigations after only adding a relatively tiny amount of data as depicted in Figure 1. In contrast we are interested in applying AL when a large amount of data already exists as is the case for many important lanuage pairs. We develop an AL algorithm that focuses on keeping annotation costs measured by time in seconds low. It succeeds in doing this by only soliciting translations for parts of sentences. We show that this gets a savings in human annotation time above and beyond what the reduction in words annotated would have indicated by a factor of about three and .

TỪ KHÓA LIÊN QUAN