Article ID Journal Published Year Pages File Type
405804 Neurocomputing 2016 11 Pages PDF
Abstract

•We exploit super-pixel to obtain semantic motion regions that determine spatial co-occurrence domains.•To capture the co-occurrence statistics at multiple temporal scales and build the relationships of them, a tree-structured model is built by a recursive manner.•High node is generated by fusing the low layer associated nodes which are connected by the patch matching.

The spatio-temporal context learnt by the traditional methods for action recognition lacks the semantic meanings and temporal relationships. In order to deal with the drawbacks, we propose a novel semantic context feature-tree model to model the video clip for efficient human action recognition. The proposed method enforces spatio-temporal interest points (STIPs) within an irregular spatio-temporal volume to construct a semantic trees-structured relationship by nearest neighbor fusion. Specifically, we firstly extract STIPs, and moreover, utilize super-pixels to segment the motion image obtained by STIP detection. The points, which fall into super-pixel, are viewed as spatial semantic co-occurring features to represent one body part. Secondly, by patch matching, the point sets which are temporal nearest neighbors are merged into a new node of the next layer to describe the context of one moving part. After matching and associating, the sets of the frame indexes are renewed to group for the next fusion process until the conditions of the recursive process do not satisfy. Using KTH, UCF-YouTube and HOHA action datasets for human action recognition, our representation based on the learnt tree-structured features enhances the discriminative power of action descriptor, and obtains promising results.

Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , , , , ,