tailieunhanh - Báo cáo khoa học: "Modeling Inflection and Word-Formation in SMT"

The current state-of-the-art in statistical machine translation (SMT) suffers from issues of sparsity and inadequate modeling power when translating into morphologically rich languages. We model both inflection and word-formation for the task of translating into German. We translate from English words to an underspecified German representation and then use linearchain CRFs to predict the fully specified German representation. We show that improved modeling of inflection and wordformation leads to improved SMT. . | Modeling Inflection and Word-Formation in SMT Alexander Fraser Marion Weller Aoife Cahill Fabienne Cap Institut fur Maschinelle Sprachverarbeitung Educational Testing Service Universitat Stuttgart Princeton NJ 08541 D-70174 Stuttgart Germany USA fraser wellermn cap @ acahill@ Abstract The current state-of-the-art in statistical machine translation SMT suffers from issues of sparsity and inadequate modeling power when translating into morphologically rich languages. We model both inflection and word-formation for the task of translating into German. We translate from English words to an underspecified German representation and then use linear-chain CRFs to predict the fully specified German representation. We show that improved modeling of inflection and wordformation leads to improved SMT. 1 Introduction Phrase-based statistical machine translation SMT suffers from problems of data sparsity with respect to inflection and word-formation which are particularly strong when translating to a morphologically rich target language such as German. We address the problem of inflection by first translating to a stem-based representation and then using a second process to inflect these stems. We study several models for doing this including strongly lexicalized models unlexicalized models using linguistic features and models combining the strengths of both of these approaches. We address the problem of word-formation for compounds in German by translating from English into German word parts and then determining whether to merge these parts to form compounds. We make the following new contributions i we introduce the first SMT system combining inflection prediction with synthesis of portmanteaus and compounds. ii For inflection we com pare the mostly unlexicalized prediction of linguistic features with a subsequent surface form generation step versus the direct prediction of surface forms and show that both approaches have complementary strengths. iii

TỪ KHÓA LIÊN QUAN