Deep sequential fusion LSTM network for image description

Article ID	Journal	Published Year	Pages	File Type
6863546	Neurocomputing	2018	11 Pages	PDF

Abstract

It is a challenging task to perform automatic image description, which aims to translate an image with visual information into natural language conforming to certain proper grammars and sentence structures. In this work, an optimal learning framework called deep sequential fusion based long short term memory network is designed. In the proposed framework, a layer-wise strategy is introduced into the generation process of recurrent neural network to increase the depth of language model for producing more abstract and discriminative features. Then, a deep supervision method is developed to enrich the model capacity with extra regularization. Moreover, the prediction scores from all of the auxiliary branches in the language model are employed to fuse the final decision output with product rule, which further makes use of the optimized model parameters and hence boosts the performance. The experimental results on two public benchmark datasets verify the effectiveness of the proposed approaches, with the consensus-based image description evaluation metricÂ (CIDEr) being 103.4 on the MSCOCO dataset and the metric for evaluation of translation with explicit orderingÂ (METEOR) reaching to 20.6 on the Flickr30K dataset.

Keywords

Image description