tailieunhanh - Learning Latent Temporal Structure for Complex Event Detection
As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computa- tions we were trying to perform but hides the messy de- tails of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is in- spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map op- eration to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the. | Learning Latent Temporal Structure for Complex Event Detection Kevin Tang Li Fei-Fei Daphne Koller Computer Science Department Stanford University kdtang feifeili koller @ Abstract In this paper we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast exact inference using dynamic programming which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset 16 and the 2011 TRECVID Multimedia Event Detection task 18 . We also illustrate and visualize the semantic understanding capabilities of our model. Figure 1. Examples of Internet videos for the event of Grooming an animal from the TRECVID MED dataset 18 that illustrate the variance in video length and temporal localization of the event. Video 3 is the only video similar to sequences typically seen in activity recognition tasks where the event occupies the video in full. 1. Introduction With the advent of Internet video hosting sites such as YouTube personal Internet videos are now becoming extremely popular. There are numerous challenges associated with the understanding of these types of videos we focus on the task of complex event detection. In our problem definition we are .
đang nạp các trang xem trước