Article ID Journal Published Year Pages File Type
11030056 Pattern Recognition Letters 2018 10 Pages PDF
Abstract
In this paper, we propose a novel automatic video captioning system which translates videos to sentences, utilizing a deep neural network that is composed of three building parts of convolutional and recurrent structure. That is, the first subnetwork operates as feature extractor of single frames. The second subnetwork is a three-stream network, capable of capturing spatial semantic information in the first stream, temporal semantic information in the second stream, and global video concept information in the third stream. The third subnetwork generates relevant textual captions using as input the spatiotemporal features of the second subnetwork. The experimental validation indicates the effectiveness of the proposed model, achieving superior performance over competitive methods.
Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, , ,