tailieunhanh - Báo cáo khoa học: "Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters"

In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, called MultiClasses. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent the positional attributes. . | Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters Hirofumi Yamamoto ATR SLT 2-2-2 Hikaridai Seika-cho Soraku-gun Kyoto-fu Japan yama@ Yoshinori Sagisaka GITI ATR SLT 1-3-10 Nishi-Waseda Shinjuku-ku Tokyo-to Japan sagisaka@ Shuntaro Isogai Waseda University 3-4-1 Okubo Shinjuku-ku Tokyo-to Japan isogai@ Abstract In this paper a new language model the Multi-Class Composite N-gram is proposed to avoid a data sparseness problem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters called MultiClasses. In the Multi-Class the statistical connectivity at each position of the N-grams is regarded as word attributes and one word cluster each is created to represent the positional attributes. Furthermore by introducing higher order word N-grams through the grouping of frequent word successions Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments the Multi-Class Composite N-grams result in lower perplexity and a 16 lower word error rate in speech recognition with a 40 smaller parameter size than conventional word 3-grams. 1 Introduction Word N-grams have been widely used as a statistical language model for language processing. Word N-grams are models that give the transition probability of the next word from the previous N - 1 word sequence based on a statistical analysis of the huge text corpus. Though word N-grams are more effective and flexible than rule-based grammatical constraints in many cases their performance strongly depends on the size of training data since they are statistical models. In word N-grams the accuracy of the word prediction capability will increase according to the number of the order N but also the number of word transition .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.