tailieunhanh - Speech recognition using neural networks - Chapter 2

Review of Speech Recognition Trong chương này, chúng tôi sẽ trình bày một đánh giá ngắn gọn về lĩnh vực nhận dạng giọng nói. Sau khi xem xét một số khái niệm cơ bản, chúng tôi sẽ giải thích các thuật toán tiêu chuẩn Uốn động Thời gian, và sau đó thảo luận về Hidden Markov Mô hình chi tiết một số, cung cấp một bản tóm tắt của các thuật toán, các biến thể, và những hạn chế có liên quan đến công nghệ này chi phối. . | 2. Review of Speech Recognition In this chapter we will present a brief review of the field of speech recognition. After reviewing some fundamental concepts we will explain the standard Dynamic Time Warping algorithm and then discuss Hidden Markov Models in some detail offering a summary of the algorithms variations and limitations that are associated with this dominant technology. . Fundamentals of Speech Recognition Speech recognition is a multileveled pattern recognition task in which acoustical signals are examined and structured into a hierarchy of subword units . phonemes words phrases and sentences. Each level may provide additional temporal constraints . known word pronunciations or legal word sequences which can compensate for errors or uncertainties at lower levels. This hierarchy of constraints can best be exploited by combining decisions probabilistically at all lower levels and making discrete decisions only at the highest level. The structure of a standard speech recognition system is illustrated in Figure . The elements are as follows Raw speech. Speech is typically sampled at a high frequency . 16 KHz over a microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over time. Signal analysis. Raw speech should be initially transformed and compressed in order to simplify subsequent processing. Many signal analysis techniques are available which can extract useful features and compress the data by a factor of ten without losing any important information. Among the most popular Fourier analysis FFT yields discrete frequencies over time which can be interpreted visually. Frequencies are often distributed using a Mel scale which is linear in the low range but logarithmic in the high range corresponding to physiological characteristics of the human ear. Perceptual Linear Prediction PLP is also physiologically motivated but yields coefficients that cannot be interpreted visually. 9 10 2. Review of Speech Recognition .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN