tailieunhanh - Báo cáo khoa học: "Contextual Dependencies in Unsupervised Word Segmentation∗"

Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on suboptimal search procedures. . | Contextual Dependencies in Unsupervised Word Segmentation Sharon Goldwater and Thomas L. Griffiths and Mark Johnson Department of Cognitive and Linguistic Sciences Brown University Providence RI 02912 Sharon_Goldwater Tom_Griffiths Mark_Johnson @brown. edu Abstract Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model and previous probabilistic models demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on sub-optimal search procedures. 1 Introduction Word segmentation . discovering word boundaries in continuous text or speech is of interest for both practical and theoretical reasons. It is the first step of processing orthographies without explicit word boundaries such as Chinese. It is also one of the key problems that human language learners must solve as they are learning language. Many previous methods for unsupervised word segmentation are based on the observation that transitions between units characters phonemes or syllables within words are generally more predictable than transitions across word boundaries. Statistics that have been proposed for measuring these differences include successor frequency Harris 1954 transitional probabilities Saf-fran et al. 1996 mutual information Sun et al. This work was partially supported by the following grants NIH 1R01-MH60922 NIH RO1-Dc000314 NsF IGERT-DGE-9870676 and the DARPA CALO project. 1998 accessor variety Feng et al. 2004 and boundary entropy Cohen and Adams 2001 . While methods based on local statistics are quite successful here we focus on approaches based on explicit probabilistic models. Formulating an explicit .