tailieunhanh - Báo cáo khoa học: "Empirical Methods for Compound Splitting"

Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of and performance gains for MT of BLEU on a German-English noun phrase translation task. | Empirical Methods for Compound Splitting Philipp Koehn Information Sciences Institute Department of Computer Science University of Southern California koehn@ Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California knight@ Abstract Compounded words are a challenge for NLP applications such as machine translation MT . We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of and performance gains for MT of BLEU on a German-English noun phrase translation task. Figure 1 Splitting options for the German word Aktionsplan 1 Introduction Compounding of words is common in a number of languages German Dutch Finnish Greek etc. . Since words may be joined freely this vastly increases the vocabulary size leading to sparse data problems. This poses challenges for a number of NLP applications such as machine translation speech recognition text classification information extraction or information retrieval. For machine translation the splitting of an unknown compound into its parts enables the translation of the compound by the translation of its parts. Take the word Aktionsplan in German see Figure 1 which was created by joining the words Ak-tion and Plan. Breaking up this compound would assist the translation into English as action plan. Compound splitting is a well defined computational linguistics task. One way to define the goal of compound splitting is to break up foreign words so that a one-to-one correspondence to English can be established. Note that we are looking for a one-to-one correspondence to English content words Say the preferred translation of Ak-tionsplan is plan for action. The lack of correspondence for the English word or does not detract from the definition of the task We would still like to break up the German compound .

TỪ KHÓA LIÊN QUAN