Recurrent attention network using spatial-temporal relations for action recognition

Article ID	Journal	Published Year	Pages	File Type
6957826	Signal Processing	2018	29 Pages	PDF

Abstract

Action recognition in videos, which contains many complex and semantic contents, is still a challenging task in computer vision research. In this paper, we propose a novel attention mechanism that leverages the gate system of Long Short Term Memory (LSTM) to compute the attention weights for action recognition. The proposed attention mechanism is embedded in a recurrent attention network that can explore the spatial-temporal relations between different local regions to concentrate important ones. For more accurate attention, we derive a new attention unit from the standard LSTM unit so as how important the local region is only depends on its input gate. Because of exploring spatial-temporal relations and using attention unit, our model can attend more accurately and thus achieve a better action recognition performance. We evaluate our proposed model on three datasets: UCF101, HMDB51 and Hollywood2, and results illustrate that our model outperforms other attention models with significant improvements.

Keywords

LSTM Action recognition Attention Mechanism