tailieunhanh - Báo cáo khoa học: "Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech"

The detection of prosodic characteristics is an important aspect of both speech synthesis and speech recognition. Correct placement of pitch accents aids in more natural sounding speech, while automatic detection of accents can contribute to better wordlevel recognition and better textual understanding. In this paper we investigate probabilistic, contextual, and phonological factors that influence pitch accent placement in natural, conversational speech in a sequence labeling setting. We introduce Conditional Random Fields (CRFs) to pitch accent prediction task in order to incorporate these factors efficiently in a sequence model. . | Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech Michelle L. Gregory Linguistics Department University at Buffalo Buffalo NY 14260 mgregory@ Yasemin Altun Department of Computer Science Brown University Providence RI 02912 altun@ Abstract The detection of prosodic characteristics is an important aspect of both speech synthesis and speech recognition. Correct placement of pitch accents aids in more natural sounding speech while automatic detection of accents can contribute to better wordlevel recognition and better textual understanding. In this paper we investigate probabilistic contextual and phonological factors that influence pitch accent placement in natural conversational speech in a sequence labeling setting. We introduce Conditional Random Fields CRFs to pitch accent prediction task in order to incorporate these factors efficiently in a sequence model. We demonstrate the usefulness and the incremental effect of these factors in a sequence model by performing experiments on hand labeled data from the Switchboard Corpus. Our model outperforms the baseline and previous models of pitch accent prediction on the Switchboard Corpus. 1 Introduction The suprasegmental features of speech relay critical information in conversation. Yet one of the major roadblocks to natural sounding speech synthesis has been the identification and implementation of prosodic characteristics. The difficulty with this task lies in the fact that prosodic cues are never absolute they are relative to individual speakers gender dialect discourse context local context phonological environment and many other factors. This is especially true of pitch accent the acoustic cues that make one word more prominent than others in an utterance. For example a word with a fundamental frequency f0 of 120 Hz would likely be quite prominent in a male speaker but not for a typical female speaker. Likewise the accent on the utterance Jon s leaving. is .