Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10151190 | Neurocomputing | 2018 | 30 Pages |
Abstract
Attention based encoder-decoder models have shown a great success on video captioning. Recent multi-modal video captioning mainly focused on applying the attention mechanism to all modalities and fusing them in the same level. However, the connections among specific modalities have not been investigated in the fusion process. In this paper, the expressivity of uni-modal is firstly investigated. Due to the characteristic of attention mechanism, an instance-level of visual content is exploited to refine the temporal features. Then, a semantic detection architecture based on CNN+RNN is also employed on the spatiotemporal content to exploit the correlations between semantic labels for better video semantic representation. Finally, a hierarchical attention-based multimodal fusion model for video captioning is proposed by jointly considering the intrinsic properties of multimodal features. Experimental results on the MSVD and MSR-VTT datasets show that the proposed method has achieved competitive performance compared with the related video captioning methods.
Keywords
Related Topics
Physical Sciences and Engineering
Computer Science
Artificial Intelligence
Authors
Chunlei Wu, Yiwei Wei, Xiaoliang Chu, Sun Weichen, Fei Su, Leiquan Wang,