tailieunhanh - Báo cáo khoa học: "Mining Parenthetical Translations from the Web by Word Alignment"

Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. . | Mining Parenthetical Translations from the Web by Word Alignment Dekang Lin Shaojun Zhao Benjamin Van Durme f Marius Pasca Google Inc. Mountain View CA 94043 lindek@ University of Rochester Rochester NY 14627 zhao@ University of Rochester Rochester NY 14627 vandurme@ Google Inc. Mountain View CA 94043 mars@ Abstract Documents in languages such as Chinese Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. Our experiment on Chinese web pages produced more than 26 million pairs of translations which is over two orders of magnitude more than previous results. We show that the addition of the extracted translation pairs as training data provides significant increase in the BLEU score for a statistical machine translation system. 1 Introduction In natural language documents a term word or phrase is sometimes followed by its translation in another language in a pair of parentheses. We call these parenthetical translations. The following examples are from Chinese web pages we added underlines to indicate what is being translated 1 MW ffi Brookings Institution w fetB ftiXfet 8itt g -ia Jeremy Shapiro ẾP . 2 i Lte O aaVWLV5. indigestion s gastritis ẺB SỄ tl . 3 Bf sfti not going to fly ift 4 .SỄ . te a linear programming . Contributions made during an internship at Google The parenthetically translated terms are typically new words technical terminologies idioms products titles of movies books songs and names of persons organizations locations etc. Commonly an author might use such a parenthetical when a given