کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
527859 869391 2012 17 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Learning semantic features for action recognition via diffusion maps
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
Learning semantic features for action recognition via diffusion maps
چکیده انگلیسی

Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the granularity, and therefore, the size of vocabulary, and the space in which features and words are clustered, i.e., the distance measure between data points at different levels of the hierarchy. The goal of this paper is to propose a representation and learning framework that addresses both these limitations.We present a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding. As opposed to flat vocabularies used in traditional methods, we propose to exploit the hierarchical nature of feature vocabularies representative of human actions. Spatiotemporal features computed around interest points in videos form the lowest level of representation. Video words are then obtained by clustering those spatiotemporal features. Each video word is then represented by a vector of Pointwise Mutual Information (PMI) between that video word and training video clips, and is treated as a mid-level feature. At the highest level of the hierarchy, our goal is to further cluster the mid-level features, while exploiting semantically meaningful distance measures between them. We conjecture that the mid-level features produced by similar video sources (action classes) must lie on a certain manifold. To capture the relationship between these features, and retain it during clustering, we propose to use diffusion distance as a measure of similarity between them. The underlying idea is to embed the mid-level features into a lower-dimensional space, so as to construct a compact yet discriminative, high level vocabulary. Unlike some of the supervised vocabulary construction approaches and the unsupervised methods such as pLSA and LDA, Diffusion Maps can capture local relationship between the mid-level features on the manifold. We have tested our approach on diverse datasets and have obtained very promising results.


► We extract hierarchical features from action videos, namely low-level, mid-level and high-level features.
► Mid-level features are video words represented by a Pointwise Mutual Information vector.
► Mid-level features are embedded into a low-dimensional space using Diffusion Maps.
► K-means are applied to discover high-level features in the low-dimensional space by grouping mid-level features.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Vision and Image Understanding - Volume 116, Issue 3, March 2012, Pages 361–377
نویسندگان
, , , ,