tailieunhanh - Báo cáo khoa học: "Correcting errors in speech recognition with articulatory dynamics"

We introduce a novel mechanism for incorporating articulatory dynamics into speech recognition with the theory of task dynamics. This system reranks sentencelevel hypotheses by the likelihoods of their hypothetical articulatory realizations which are derived from relationships learned with aligned acoustic/articulatory data. | Correcting errors in speech recognition with articulatory dynamics Frank Rudzicz University of Toronto Department of Computer Science Toronto Ontario Canada frank@ Abstract We introduce a novel mechanism for incorporating articulatory dynamics into speech recognition with the theory of task dynamics. This system reranks sentencelevel hypotheses by the likelihoods of their hypothetical articulatory realizations which are derived from relationships learned with aligned acoustic articulatory data. Experiments compare this with two baseline systems namely an acoustic hidden Markov model and a dynamic Bayes network augmented with discretized representations of the vocal tract. Our system based on task dynamics reduces worderror rates significantly by relative to the best baseline models. 1 Introduction Although modern automatic speech recognition ASR takes several cues from the biological perception of speech it rarely models its biological production. The result is that speech is treated as a surface acoustic phenomenon with lexical or phonetic hidden dynamics but without any physical constraints in between. This omission leads to some untenable assumptions. For example speech is often treated out of convenience as a sequence of discrete non-overlapping packets such as phonemes despite the fact that some major difficulties in ASR such as co-articulation are by definition the result of concurrent physiological phenomena Hardcastle and Hewlett 1999 . Many acoustic ambiguities can be resolved with knowledge of the vocal tract s configuration O Shaughnessy 2000 . For example the three nasal sonorants m n and ng are acoustically similar . they have large concentrations of energy at the same frequencies but uniquely and reliably involve bilabial closure tongue-tip elevation and tongue-dorsum elevation respectively. Having access to the articulatory goals of the speaker would in theory make the identification of linguistic intent almost trivial. Although

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN