tailieunhanh - Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling"
In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. | Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi Mochihashi Takeshi Yamada Naonori Ueda NTT Communication Science Laboratories Hikaridai 2-4 Keihanna Science City Kyoto Japan daichi yamada ueda @ Abstract In this paper we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language without any word indications. 1 Introduction Word is no trivial concept in many languages. Asian languages such as Chinese and Japanese have no explicit word boundaries thus word segmentation is a crucial first step when processing them. Even in western languages valid words are often not identical to space-separated tokens. For example proper nouns such as United Kingdom or idiomatic phrases such as with respect to actually function as a single word and we often condense them into the virtual words UK and . . In order to extract words from text streams unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary and will be suboptimal for applications that rely on segmentations. It is particularly difficult to create correct training data for speech transcripts colloquial texts and classics where segmentations are often ambiguous let alone is impossible for unknown languages whose properties computational linguists might seek to uncover. From a scientific point of view it is also interesting because it can shed light on how .
đang nạp các trang xem trước