tailieunhanh - Báo cáo khoa học: "Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models"

This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. . | Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models Jason Naradowsky Department of Computer Science University of Massachusetts Amherst Amherst MA 01003 narad@ Kristina Toutanova Microsoft Research Redmond WA 98502 kristout@ Abstract This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information part-of-speech morphological segmentation while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets. 1 Introduction An enduring problem in statistical machine translation is sparsity. The word alignment models of modern MT systems attempt to capture p ei fj the probability that token ei is a translation of fj. Underlying these models is the assumption that the word-based tokenization of each sentence is if not optimal at least appropriate for specifying a conceptual mapping between the two languages. However when translating between unrelated languages - a common task - disparate morphological systems can place an asymmetric conceptual burden on words making the lexicon of one language much more coarse. This intensifies the problem of sparsity as the large number of word forms created This research was conducted during the author s internship at Microsoft Research 895 through morphologically productive processes hinders attempts to find concise mappings between concepts. For instance Bulgarian adjectives may contain markings for gender .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.