Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
4948296 | Neurocomputing | 2016 | 6 Pages |
Abstract
The task of multimedia event detection (MED) aims at training a set of models that can automatically detect the most event-relevant videos from large datasets. In this paper, we attempt to build a robust spatial-temporal deep neural network for large-scale video event detection. In our setting, each video follows a multiple instance assumption, where its visual segments contain both spatial and temporal properties of events. Regarding these properties, we try to implement the MED system by a two-step training phase: unsupervised recurrent video reconstruction and supervised fine-tuning. We conduct extensive experiments on the challenging TRECVID MED14 dataset, which indicate that with the consideration of both spatial and temporal information, the detection performance can be further boosted compared with the state-of-the-art MED models.
Keywords
Related Topics
Physical Sciences and Engineering
Computer Science
Artificial Intelligence
Authors
Litao Yu, Xiaoshuai Sun, Zi Huang,