Article ID Journal Published Year Pages File Type
4968815 Computer Vision and Image Understanding 2017 16 Pages PDF
Abstract

•We propose a novel and generic approach for cross-modal search based on Structural SVM.•Our approach provides max-margin guarantees and better generalization than competing methods.•We analyze and compare different aspects of our approach such as training and testing time, and performance across datasets.•Extensive experiments demonstrate the efficacy of our approach.

Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given a query image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data).We propose a novel Structural SVM based unified framework for these two tasks, and show how it can be efficiently trained and tested. Using a variety of loss functions, extensive experiments are conducted on three popular datasets (two medium-scale datasets containing few thousands of samples, and one web-scale dataset containing one million samples). Experiments demonstrate that our framework gives promising results compared to competing baseline cross-modal search techniques, thus confirming its efficacy.

Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, ,