A quantitative evaluation of the conceptual consistency of visual words and visual vocabularies

Article ID	Journal	Published Year	Pages	File Type
532452	Journal of Visual Communication and Image Representation	2015	10 Pages	PDF

Abstract

•We question/analyse the basic assumptions made for the visual words approach.•Experimental support for the following three statements.•There are more visually distinct patterns than can be listed in a codebook.•One element of a codebook represents a set of many, visually distinct patterns.•There are no single, selective SIFT descriptors to serve as codebook elements.

Codebooks are a widely accepted technique to recognise objects by sets of local features. The method has been applied to many classes of objects, even very abstract ones. But although state of the art recognition rates have been reported, the method is still far away from being reliable in any sense that is related to human vision. The literature on this topic emphasises detailed descriptions of statistical estimators over a basic analysis of the data. A deeper understanding of the data is however needed to achieve a further development of the field. In this paper, we therefore present a set of quantitative experiments on codebooks of the popular SIFT descriptors. The results discourage the use of illustrative but overly simplifying descriptions of the visual words approach. It is in particular demonstrated that (1) there are more visually distinct patterns than can be listed in a codebook, (2) one element of a codebook represents a set of many, visually distinct patterns, and (3) there are no single, selective SIFT descriptors to serve as codebook elements. This makes us wonder why the method works after all. We discuss several options.

Keywords

SIFT Pattern recognition Computer vision Image classification SURF Codebook Bag of Visual Words