tailieunhanh - Báo cáo khoa học: "N-Best Rescoring Based on Pitch-accent Patterns"
In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount of. | N-Best Rescoring Based on Pitch-accent Patterns Je Hun Jeon1 Wen Wang2 Yang Liu1 department of Computer Science The University of Texas at Dallas USA 2Speech Technology and Research Laboratory SRI International USA jhjeon yangl @ wwang@ Abstract In this paper we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition ASR performance. The pitch-accent model is decoupled from the main ASR system thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm we use two different data sets and recognition setups the first one is English radio news data that has pitch accent labels but the recognizer is trained from a small amount of data and has high error rate the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3 . This gain is consistent across the two different tests showing promising future directions of incorporating prosodic information to improve speech recognition. 1 Introduction Prosody refers to the suprasegmental features of natural speech such as rhythm and intonation since it normally extends over more than one phoneme segment. Speakers use prosody to convey paralin-guistic information such as emphasis intention attitude and emotion. Humans listening to speech with natural prosody are able to understand the content with low cognitive load and high accuracy. However most modern ASR systems only use an acous 732 tic model and a language model. Acoustic information in ASR is represented by spectral features that are usually extracted over a window length of a few tens of milliseconds. They miss useful information contained in the prosody of the speech
đang nạp các trang xem trước