Improving the BoVW via discriminative visual n-grams and MKL strategies

Article ID	Journal	Published Year	Pages	File Type
407207	Neurocomputing	2016	14 Pages	PDF

Abstract

The Bag-of-Visual-Words (BoVW) representation has been widely used to approach a number of different high-level computer vision tasks. The idea behind the BoVW representation is similar to the Bag-of-Words (BoW) used in Natural Language Processing (NLP) tasks: to extract features from the dataset, then build feature histograms that represent each instance. Although the approach is simple and effective facilitating its applicability to a wide range of problems, it inherits a well-known limitation from the traditional BoW; the disregarding of spatial information among extracted features (sequential information in text), which could be useful to capture discriminative visual-patterns. In this paper, we alleviate this limitation with the joint use of visual words and multi-directional sequences of visual words (visual n-grams). The contribution of this paper is twofold: (i) to build new simple-effective visual features inspired in the popular idea of n-gram representations in NLP and (ii) to propose the Multiple Kernel Learning (MKL) strategies to better exploit the joint use of visual words and visual n-grams in Image Classification (IC) tasks. For the former, we propose building a codebook of visual n-grams, and use them as attributes to represent images by means of the BoVW representation. For the second point, we consider the visual words and visual n-grams as different feature spaces, then we propose MKL strategies to better integrate the visual information. We evaluate our proposal in the image classification task using five different datasets: Histopathology, Birds, Butterflies, Scenes and a subset of 6 classes of CalTech-101. Experimental results show that the proposed strategies exploiting our visual n-grams, outperforms or is competitive with (i) the traditional BoVW, (ii) the BoVW using visual n-grams under traditional fusion schemes (e.g., ensemble based classifiers) and (iii) other approaches in the literature for IC that consider the spatial context.

Keywords

Image classification Visual words Multiple kernel learning