Article ID Journal Published Year Pages File Type
4968857 Computer Vision and Image Understanding 2017 18 Pages PDF
Abstract
Weakly supervised learning for object detection has been gaining significant attention in the recent past. Visually similar objects are extracted automatically from weakly labeled videos hence bypassing the tedious process of manually annotating training data. However, the problem as applied to small or medium sized objects is still largely unexplored. Our observation is that weakly labeled information can be derived from videos involving human-object interactions. Since the object is characterized neither by its appearance nor its motion in such videos, we propose a robust framework that taps valuable human context and models similarity of objects based on appearance and functionality. Furthermore, the framework is designed such that it maximizes the utility of the data by detecting possibly multiple instances of an object from each video. We show that object models trained in this fashion perform between 86% and 92% of their fully supervised counterparts on three challenging RGB and RGB-D datasets.
Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, ,