کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
527851 869388 2012 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Discovering hierarchical object models from captioned images
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
Discovering hierarchical object models from captioned images
چکیده انگلیسی

We address the problem of automatically learning the recurring associations between the visual structures in images and the words in their associated captions, yielding a set of named object models that can be used for subsequent image annotation. In previous work, we used language to drive the perceptual grouping of local features into configurations that capture small parts (patches) of an object. However, model scope was poor, leading to poor object localization during detection (annotation), and ambiguity was high when part detections were weak. We extend and significantly revise our previous framework by using language to drive the perceptual grouping of parts, each a configuration in the previous framework, into hierarchical configurations that offer greater spatial extent and flexibility. The resulting hierarchical multipart models remain scale, translation and rotation invariant, but are more reliable detectors and provide better localization. Moreover, unlike typical frameworks for learning object models, our approach requires no bounding boxes around the objects to be learned, can handle heavily cluttered training scenes, and is robust in the face of noisy captions, i.e., where objects in an image may not be named in the caption, and objects named in the caption may not appear in the image. We demonstrate improved precision and recall in annotation over the non-hierarchical technique and also show extended spatial coverage of detected objects.


► We learn to recognize exemplars from unstructured collections of captioned images.
► Using language, we perceptually group local features into meaningful parts.
► We further group discovered parts into flexible hierarchical configurations.
► Learned visual structures are scale, translation and rotation invariant.
► Learning is robust to distractors, clutter, ambiguous and incomplete captions.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Vision and Image Understanding - Volume 116, Issue 7, July 2012, Pages 842–853
نویسندگان
, , , , ,