Towards situated speech understanding: visual context priming of language models

Article ID	Journal	Published Year	Pages	File Type
10368610	Computer Speech & Language	2005	22 Pages	PDF

Abstract

Fuse is a situated spoken language understanding system that uses visual context to steer the interpretation of speech. Given a visual scene and a spoken description, the system finds the object in the scene that best fits the meaning of the description. To solve this task, Fuse performs speech recognition and visually-grounded language understanding. Rather than treat these two problems separately, knowledge of the visual semantics of language and the specific contents of the visual scene are fused during speech processing. As a result, the system anticipates various ways a person might describe any object in the scene, and uses these predictions to bias the speech recognizer towards likely sequences of words. A dynamic visual attention mechanism is used to focus processing on likely objects within the scene as spoken utterances are processed. Visual attention and language prediction reinforce one another and converge on interpretations of incoming speech signals which are most consistent with visual context. In evaluations, the introduction of visual context into the speech recognition process results in significantly improved speech recognition and understanding accuracy. The underlying principles of this model may be applied to a wide range of speech understanding problems including mobile and assistive technologies in which contextual information can be sensed and semantically interpreted to bias processing.