tailieunhanh - Báo cáo khoa học: "A Succinct N-gram Language Model"

Efficient processing of tera-scale text data is an important research topic. This paper proposes lossless compression of N gram language models based on LOUDS, a succinct data structure. LOUDS succinctly represents a trie with M nodes as a 2M + 1 bit string. We compress it further for the N -gram language model structure. We also use ‘variable length coding’ and ‘block-wise compression’ to compress values associated with nodes. Experimental results for three large-scale N -gram compression tasks achieved a significant compression rate without any loss. . | A Succinct N-gram Language Model Taro Watanabe Hajime Tsukada Hideki Isozaki NTT Communication Science Laboratories 2-4 Hikaridai Seika-cho Soraku-gun Kyoto 619-0237 Japan taro tsukada isozaki @ Abstract Efficient processing of tera-scale text data is an important research topic. This paper proposes lossless compression of N-gram language models based on LOUDS a succinct data structure. LOUDS succinctly represents a trie with M nodes as a 2M 1 bit string. We compress it further for the N -gram language model structure. We also use variable length coding and block-wise compression to compress values associated with nodes. Experimental results for three large-scale N -gram compression tasks achieved a significant compression rate without any loss. 1 Introduction There has been an increase in available N-gram data and a large amount of web-scaled N-gram data has been successfully deployed in statistical machine translation. However we need either a machine with hundreds of gigabytes of memory or a large computer cluster to handle them. Either pruning Stolcke 1998 Church et al. 2007 or lossy randomizing approaches Talbot and Brants 2008 may result in a compact representation for the application run-time. However the lossy approaches may reduce accuracy and tuning is necessary. A lossless approach is obviously better than a lossy one if other conditions are the same. In addtion a lossless approach can easly combined with pruning. Therefore lossless representation of N-gram is a key issue even for lossy approaches. Raj and Whittaker 2003 showed a general N-gram language model structure and introduced a lossless algorithm that compressed a sorted integer vector by recursively shifting a certain number of bits and by emitting index-value inverted vectors. However we need more compact representation. In this work we propose a succinct way to represent the N-gram language model structure based on LOUDS Jacobson 1989 Delpratt et al. 2006 . It was first .

TÀI LIỆU LIÊN QUAN