A robust/fast spoken term detection method based on a syllable n-gram index with a distance metric

Article ID	Journal	Published Year	Pages	File Type
565939	Speech Communication	2013	16 Pages	PDF

Abstract

For spoken document retrieval, it is crucial to consider Out-of-vocabulary (OOV) and the mis-recognition of spoken words. Consequently, sub-word unit based recognition and retrieval methods have been proposed. This paper describes a Japanese spoken term detection method for spoken documents that robustly considers OOV words and mis-recognition. To solve the problem of OOV keywords, we use individual syllables as the sub-word unit in continuous speech recognition. To address OOV words, recognition errors, and high-speed retrieval, we propose a distant n-gram indexing/retrieval method that incorporates a distance metric in a syllable lattice. When applied to syllable sequences, our proposed method outperformed a conventional DTW method between syllable sequences and was about 100 times faster. The retrieval results show that we can detect OOV words in a database containing 44 h of audio in less than 10 m sec per query with an F -measure of 0.540.54.

► New spoken term detection technique for large spoken documents is proposed. ► It is based on syllable-trigram with distance metric considering recognition errors. ► It outperformed DTW between syllable sequences and was about 100 times faster. ► OOV terms can be detected in less than 10 ms per query from 44 h audio data. ► F-measure was 0.54 in the spoken OOV term retrieval case.

Keywords

n-Gram Spoken term detection