کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4973656 1451683 2017 36 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Cross database audio visual speech adaptation for phonetic spoken term detection
ترجمه فارسی عنوان
سازگاری گفتار بصری صوتی پایگاه داده برای تشخیص اصطلاح گفتاری آوایی
کلمات کلیدی
تشخیص اصطلاح گفتاری، مدل مارکف پنهان همگام، آموزش متشکل از پایگاه داده، شناسایی تلفن،
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال
چکیده انگلیسی
Spoken term detection (STD), the process of finding all occurrences of a specified search term in a large amount of speech segments, has many applications in multimedia search and retrieval of information. It is known that use of video information in the form of lip movements can improve the performance of STD in the presence of audio noise. However, research in this direction has been hampered by the unavailability of large annotated audio visual databases for development. We propose a novel approach to develop audio visual spoken term detection when only a small (low resource) audio visual database is available for development. First, cross database training is proposed as a novel framework using the fused hidden Markov modeling (HMM) technique, which is used to train an audio model using extensive large and publicly available audio databases; then it is adapted to the visual data of the given audio visual database. This approach is shown to perform better than standard HMM joint-training method and also improves the performance of spoken term detection when used in the indexing stage. In another attempt, the external audio models are first adapted to the audio data of the given audio visual database and then they are adapted to the visual data. This approach also improves both phone recognition and spoken term detection accuracy. Finally, the cross database training technique is used as HMM initialization, and an extra parameter re-estimation step is applied on the initialized models using Baum Welch technique. The proposed approaches for audio visual model training have allowed for benefiting from both large extensive out of domain audio databases that are available and the small audio visual database that is given for development to create more accurate audio-visual models.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Speech & Language - Volume 44, July 2017, Pages 1-21
نویسندگان
, , ,