Article ID Journal Published Year Pages File Type
535856 Pattern Recognition Letters 2012 8 Pages PDF
Abstract

This paper proposes a multimodal approach to distinguish silence from speech situations, and to identify the location of the active speaker in the latter case. In our approach, a video camera is used to track the faces of the participants, and a microphone array is used to estimate the Sound Source Location (SSL) using the Steered Response Power with the phase transform (SRP-PHAT) method. The audiovisual cues are combined, and two competing Hidden Markov Models (HMMs) are used to detect silence or the presence of a person speaking. If speech is detected, the corresponding HMM also provides the spatio-temporally coherent location of the speaker. Experimental results show that incorporating the HMM improves the results over the unimodal SRP-PHAT, and the inclusion of video cues provides even further improvements.

► Multimodal approach for HCI. ► Use of microphone array and a single câmera. ► HMM for spatio-temporal coherence. ► Joint approach for voice activity detection (VAD) and Sound Source Localization (SSL).

Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, , , , ,