Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
536541 | Pattern Recognition Letters | 2011 | 10 Pages |
To improve the performance of voice activity detector (VAD) in noisy environments, this paper concentrates on three critical aspects related to noise robustness including speech features, feature distributions and temporal dependence. Based on the statistic on TIMIT and NOIZEUS, Mel-frequency cepstrum coefficients (MFCCs) are selected as speech features, Gaussian Mixture distributions (GMD) are applied to associate the observations in MFCC domain with both speech and non-speech states, and Weibull and Gamma distributions are used to explicitly model noise and speech durations, respectively. To integrate these aspects into VAD, the hidden semi-Markov model (HSMM) as a generalized hidden Markov model (HMM) is introduced first. Then the VAD decision is made according to the likelihood ratio test (LRT) incorporating state prior knowledge and modified forward variables of HSMM. We design a recursive way to efficiently calculate modified forward variables. Finally a series of experiments demonstrate: (1) the positive effect of different robustness-related schemes adopted in the proposed VAD; (2) better performance against the standard ITU-T G.729B, Adaptive MultiRate VAD phase 2 (AMR2), Advanced Front-end (AFE), HMM-based VAD and VAD using Laplacian–Gaussian model (LD–GD based VAD).
Research highlights► We study three noise-robust schemes for VAD and first model durations explicitly. ► Speech and noise durations are found to follow Gamma and Weibull distributions. ► We first use HSMM to integrate noise-robust schemes and LRT into VAD. ► Modified forward variables calculate state probabilities recursively and efficiently. ► HSMM-based VAD outperforms G.729B, AMR2, AFE, HMM-based and LD-GD based VADs.