کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4973711 1451681 2017 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Scalable algorithms for unsupervised clustering of acoustic data for speech recognition
ترجمه فارسی عنوان
الگوریتم های مقیاس پذیر برای خوشه بندی بدون نظارت داده های صوتی برای تشخیص گفتار
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال
چکیده انگلیسی
In this paper an unsupervised clustering algorithm is developed for acoustic data in the context of speech recognition tasks. One of the key features of the algorithm is scalability to large data sets. Specifically, given the unlabeled training and test sets, the class-labels of the utterances are obtained in an automatic manner. The extracted labels may correspond to the speakers in the speech corpus if the data is relatively clean. The proposed scheme is attractive from an industrial perspective as it alleviates the need to store the speaker-labels manually, saving considerable amount of human efforts and expenses. The core of the algorithm comprises a three-stage architecture that processes the input data one after the other, while each stage is designed to perform a well-defined and specific task. In more detail, the first-pass involves a bottom-up clustering mechanism, the second-pass comprises a cluster splitting operation and the third-pass consists of a cluster refining process. Each of the stages allows for data parallelization using multiple CPUs that leads to faster computation. Two alternative forms of the algorithm are presented - the first considers Gaussian distributions and the other i-Vectors - to facilitate the clustering. Although the algorithm may find applications in various realms of speech recognition, in this paper, the effectiveness of the schemes are evaluated by means of speaker adaptive training (SAT) and speaker-aware training of DNN-HMM acoustic models. In particular, experiments are conducted on the Switchboard task to extract the speaker-labels for the utterances in the training and test sets. It is shown that the SAT DNN-HMM trained using the Gaussian based scheme yields a 7.2% relative improvement in the ASR accuracy over the speaker independent DNN-HMM, whereas the i-Vector approach provides an additional improvement, amounting to a 10.8% relative gain overall. The standard SAT DNN-HMM developed using the ground-truth speaker-labels is found to be only 2.7% relative better than the proposed scheme. Similar observation is made as with speaker-aware training. The analysis of computational complexity, conducted stage by stage, demonstrates the scalability of the proposed algorithms.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Speech & Language - Volume 46, November 2017, Pages 233-248
نویسندگان
,