Article ID Journal Published Year Pages File Type
4970129 Pattern Recognition Letters 2017 10 Pages PDF
Abstract
Existing Convolutional Neural Networks (CNNs) based methods for action recognition are either spatial or temporally local while actions are 3D signals. In this paper, we propose a global spatial-temporal three-stream CNNs architecture, which is able to be used for action feature extraction. Specifically, the three-stream CNNs comprises of spatial, local temporal and global temporal streams generated respectively from deep learning single frame, optical flow and global accumulated motion features in the form of a new formulation named Motion Stacked Difference Image (MSDI). Moreover, a novel soft Vector of Locally Aggregated Descriptors (soft-VLAD) is developed to further represent the extracted features, combining the advantage of Gaussian Mixture Models (GMMs) and VLAD by encoding data according to their overall probability distribution and the corresponding difference with respect to clustered centers. To deal with the inadequacy of training samples during learning, we introduce a data augmentation scheme which is very efficient due to its origin at cropping across videos. We conduct our experiments on UCF101 and HMDB51 datasets, and the results demonstrate the effectiveness of our approach.
Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, , , ,