tailieunhanh - Báo cáo khoa học: "Web augmentation of language models for continuous speech recognition of SMS text messages"

In this paper, we present an efficient query selection algorithm for the retrieval of web text data to augment a statistical language model (LM). The number of retrieved relevant documents is optimized with respect to the number of queries submitted. The querying scheme is applied in the domain of SMS text messages. Continuous speech recognition experiments are conducted on three languages: English, Spanish, and French. The web data is utilized for augmenting in-domain LMs in general and for adapting the LMs to a user-specific vocabulary. . | Web augmentation of language models for continuous speech recognition of SMS text messages Mathias Creutz1 Sami Virpioja1 2 and Anna Kovaleva1 1Nokia Research Center Helsinki Finland 2 Adaptive Informatics Research Centre Helsinki University of Technology Espoo Finland annakov@ Abstract In this paper we present an efficient query selection algorithm for the retrieval of web text data to augment a statistical language model LM . The number of retrieved relevant documents is optimized with respect to the number of queries submitted. The querying scheme is applied in the domain of SMS text messages. Continuous speech recognition experiments are conducted on three languages English Spanish and French. The web data is utilized for augmenting in-domain LMs in general and for adapting the LMs to a user-specific vocabulary. Word error rate reductions of up to in LM augmentation and in LM adaptation are obtained in setups where the size of the web mixture LM is limited to the size of the baseline in-domain LM. 1 Introduction An automatic speech recognition ASR system consists of acoustic models of speech sounds and of a statistical language model LM . The LM learns the probabilities of word sequences from text corpora available for training. The performance of the model depends on the amount and style of the text. The more text there is the better the model is in general. It is also important that the model be trained on text that matches the style of language used in the ASR application. Well matching in-domain text may be both difficult and expensive to obtain in the large quantities that are needed. A popular solution is to utilize the World Wide Web as a source of additional text for LM training. A small in-domain set is used as seed data and more data of the same kind is retrieved from the web. A decade ago Berger and Miller 1998 proposed a just-in-time LM that updated the current LM by retrieving data from .

TỪ KHÓA LIÊN QUAN