tailieunhanh - Báo cáo khoa học: "Inducing Probabilistic Syllable Classes Using Multivariate Clustering"

An approach to automatic detection of syllable structure is presented. We demonstrate a novel application of EM-based clustering to multivariate data, exempli ed by the induction of 3- and 5-dimensional probabilistic syllable classes. The qualitative evaluation shows that the method yields phonologically meaningful syllable classes. We then propose a novel approach to grapheme-to-phoneme conversion and show that syllable structure represents valuable information for pronunciation systems. | Inducing Probabilistic Syllable Classes Using Multivariate Clustering Karin Miiller Bernd Mobius and Detlef Prescher Institut fill Maschinelle Sprachverarbeitung University of Stuttgart Germany karin .mueller I bernd .moebius I detlef .prescher @ims. uni-stuttgart. de Abstract An approach to automatic detection of syllable structure is presented. We demonstrate a novel application of EM-based clustering to multivariate data exemplified by the induction of 3- and 5-dimensional probabilistic syllable classes. The qualitative evaluation shows that the method yields phonologically meaningful syllable classes. We then propose a novel approach to grapheme-to-pho-neme conversion and show that syllable structure represents valuable information for pronunciation systems. 1 Introduction In this paper we present an approach to unsupervised learning and automatic detection of syllable structure. The primary goal of the paper is to demonstrate the application of EM-based clustering to multivariate data. The suitability of this approach is exemplified by the induction of 3- and 5-dimensional probabilistic syllable classes. A secondary goal is to outline a novel approach to the conversion of graphemes to phonemes g2p which uses a context-free grammar cfg to generate all sequences of phonemes corresponding to a given orthographic input word and then ranks the hypotheses according to the probabilistic information coded in the syllable classes. Our approach builds on two resources. The first resource is a cfg for g2p conversion that was constructed manually by a linguistic expert Muller 2000 . The grammar describes how words are composed of syllables and how syllables consist of parts that are conventionally called onset nucleus and coda which in turn are composed of phonemes and corresponding graphemes. The second resource consists of a multivariate clustering algorithm that is used to reveal syllable structure hidden in unannotated training data. In a first step we collect .