A review of recent advances in visual speech decoding

Article ID	Journal	Published Year	Pages	File Type
527042	Image and Vision Computing	2014	16 Pages	PDF

Abstract

•A detailed review of the recent advances in the area of visual speech decoding.•Visual features tackling speaker dependency, head poses and temporal information.•Dynamic audio-visual speech information fusion.•Recent techniques of facial landmark localization.•Summary of audio-visual speech databases and ASR performance on them.

Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. This paper provides a detailed review of recent advances in this research area. In comparison with the previous survey [97] which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state-of-the-art on facial landmark localization is briefly introduced in this paper. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding.

Graphical abstractFigure optionsDownload full-size imageDownload high-quality image (147 K)Download as PowerPoint slide

Keywords

Automatic speech recognition Lip-reading Review