Article ID Journal Published Year Pages File Type
6938274 Journal of Visual Communication and Image Representation 2018 19 Pages PDF
Abstract
In this paper, we propose an efficient and straightforward approach, video you only look once (VideoYOLO), to capture the overall temporal dynamics from an entire video in a single process for action recognition. It remains an open question for action recognition on how to deal with the temporal dimension in videos. Existing methods subdivide a whole video into either individual frames or short clips and consequently have to process these fractions multiple times. A post process is then used to aggregate the partial dynamic cues to implicitly infer the whole temporal information. On the contrary, in VideoYOLO, we first generate a proxy video by selecting a subset of frames to roughly reserve the overall temporal dynamics presented in the original video. A 3D convolutional neural network (3D-CNN) is employed to learn the overall temporal characteristics from the proxy video and predict action category in a single process. Our proposed method is extremely fast. VideoYOLO-32 is able to process 36 videos per second that is 10 times and 7 times faster than prior 2D-CNN (Two-stream (Simonyan and Zisserman, 2014)) and 3D-CNN (C3D (Tran et al., 2015)) based models, respectively, while still achieves superior or comparable classification accuracies on the benchmark datasets, UCF101 and HMDB51.
Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, , ,