Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
534460 | Pattern Recognition Letters | 2010 | 13 Pages |
Speech Activity Detectors (SADs) are essential in the noisy environments to provide an acceptable performance in the speech applications, such as speech recognition tasks. In this paper, a two-stage speech activity detection system is presented which at first takes advantage of a voice activity detector to discard pause segments out of the audio signals; this is done even in presence of stationary background noises. In the second stage, the remained segments are classified into speech or non-speech. To find the best feature set in speech/non-speech classification, a large set of robust features are introduced; the optimal subset of these features are chosen by applying a Genetic Algorithm (GA) to the initial feature set. It has been discovered that fractal dimensions of numeric series of prosodic features are the most speech/non-speech differentiating features. Models of the system are trained over a Farsi database, FARSDAT, however, test experiments on the TIMIT English database have been also conducted. Employing the SAD system in conjunction with an ASR system, has been resulted in a relative Word Error Rate (WER) reduction of as high as 28.3%.