کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4944902 1438015 2016 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Visual speaker identification and authentication by joint spatiotemporal sparse coding and hierarchical pooling
ترجمه فارسی عنوان
شناسایی و تأیید هویت سخنران ویژوال با استفاده از برنامه نویسی و سلسله مراتبی چندگانه فضایی مشترک
کلمات کلیدی
شناسایی بلندگوهای ویژوال، احراز هویت سخنران ویژوال، برنامه نویسی انعطاف پذیر، توزیع سلسله مراتبی،
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی
Recent research shows that lip shape and lip movement contain abundant identity-related information and can be used as a new kind of biometrics in speaker identification or authentication. In this paper, we propose a new lip feature representation for lip biometrics which is able to describe the static and dynamic characteristics of a lip sequence. The new representation captures both the physiological and behavioral aspects of the lip and is robust against variations caused by different speaker position and pose. In our approach, a lip sequence is first divided into several subsequences along the temporal dimension. For each subsequence, sparse coding (SC in short) is adopted to characterize the minutiae of the lip region and its movement in small spatiotemporal cells. Then max-pooling based on a hierarchical spatiotemporal structure is performed on the SC codes to generate the final feature for each of the subsequence. Finally, the entire lip sequence is represented by a set of features corresponding to each subsequence in it. Experiments are carried out on a dataset with 40 speakers and compared with three state-of-the-art approaches. From the experimental results, it was observed that the proposed feature achieved high identification accuracy (an accuracy of 99.96%) and very low authentication error (a Half Total Error Rate (HTER) of 0.46%), and outperformed the other approaches investigated. Moreover, even with random variations caused by different speaker position and pose, the proposed feature still provides good identification (an accuracy of 99.18%) and authentication results (a HTER of 2.34%) and has much lower performance degradation compared with the other approaches investigated. Finally, even when there is only one training sample per speaker, the proposed feature still achieves high discriminative power (an accuracy of 98.39% and HTER of 2.62%).
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 373, 10 December 2016, Pages 219-232
نویسندگان
, , , ,