tailieunhanh - Báo cáo khoa học: "Grounded Language Modeling for Automatic Speech Recognition of Sports Video"

Grounded language models represent the relationship between words and the non-linguistic context in which they are said. This paper describes how they are learned from large corpora of unlabeled video, and are applied to the task of automatic speech recognition of sports video. Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions. | Grounded Language Modeling for Automatic Speech Recognition of Sports Video Michael Fleischman Massachusetts Institute of Technology Media Laboratory mbf@ Deb Roy Massachusetts Institute of Technology Media Laboratory dkroy@ Abstract Grounded language models represent the relationship between words and the non-linguistic context in which they are said. This paper describes how they are learned from large corpora of unlabeled video and are applied to the task of automatic speech recognition of sports video. Results show that grounded language models improve perplexity and word error rate over text based language models and further support video information retrieval better than human generated speech transcriptions. 1 Introduction Recognizing speech in broadcast video is a necessary precursor to many multimodal applications such as video search and summarization Snoek and Worring 2005 . Although performance is often reasonable in controlled environments such as studio news rooms automatic speech recognition ASR systems have significant difficulty in noisier settings such as those found in live sports broadcasts Wactlar et al. 1996 . While many researches have examined how to compensate for such noise using acoustic techniques few have attempted to leverage information in the visual stream to improve speech recognition performance for an exception see Murkherjee and Roy 2003 . In many types of video however visual context can provide valuable clues as to what has been said. For example in video of Major League Baseball games the likelihood of the phrase home run increases dramatically when a home run has actually been hit. This paper describes a method for incorporating such visual information in an ASR system for sports video. The method is based on the use of grounded language models to repre sent the relationship between words and the non-linguistic context to which they refer Fleischman and Roy 2007 . Grounded language models are based on .