Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
535542 | Pattern Recognition Letters | 2013 | 9 Pages |
Human action recognition in video is important in many computer vision applications such as automated surveillance. Human actions can be compactly encoded using a sparse set of local spatio-temporal salient features at different scales. The existing bottom-up methods construct a single dictionary of action primitives from the joint features of all scales and hence, a single action representation. This representation cannot fully exploit the complementary characteristics of the motions across different scales. To address this problem, we introduce the concept of learning multiple dictionaries of action primitives at different resolutions and consequently, multiple scale-specific representations for a given video sample. Using a decoupled fusion of multiple representations, we improved the human classification accuracy of realistic benchmark databases by about 5%5%, compared with the state-of-the art methods.
► The standard BOW framework cannot fully exploit the multiresolution motion’s characteristics. ► Multiple scale-specific dictionaries of action primitives are complementary. ► A set of multiple scale-specific representations is more discriminant than a single global one. ► Decoupled and concatenated representations perform better than a single action representation.