tailieunhanh - Báo cáo khoa học: "Fully Unsupervised Word Segmentation with BVE and MDL"
Department of Computer Science University of Arizona Tucson, AZ 85721 {dhewlett,cohen}@ Abstract Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the. | Fully Unsupervised Word Segmentation with BVE and MDL Daniel Hewlett and Paul Cohen Department of Computer Science University of Arizona Tucson AZ 85721 dhewlett cohen @ Abstract Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However since the space of potential segmentations grows exponentially with the length of the corpus no tractable algorithm follows directly from the Minimum Description Length MDL principle. Therefore it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL. 1 Introduction The goal of unsupervised word segmentation is to discover correct word boundaries in natural language corpora where explicit boundaries are absent. Often unsupervised word segmentation algorithms rely heavily on parameterization to produce the correct segmentation for a given language. The goal of fully unsupervised word segmentation then is to recover the correct boundaries for arbitrary natural language corpora without explicit human parameterization. This means that a fully unsupervised algorithm would have to set its own parameters based only on the corpus provided to it. In principle this goal can be achieved by creating a function that measures the quality of a segmentation in a language-independent way and applying this function to all possible segmentations of 540 the corpora to select the best one. Evidence from the word segmentation literature suggests that description length provides a good approximation to this segmentation quality function. We discuss the Minimum Description Length MDL principle in more detail in the next section. Unfortunately
đang nạp các trang xem trước