Detection of speaker individual information using a phoneme effect suppression method

Article ID	Journal	Published Year	Pages	File Type
565917	Speech Communication	2014	14 Pages	PDF

Abstract

•We investigate phoneme contributions to speaker individuality using three languages.•Voiced and unvoiced phonemes have different effects on speaker information.•Proposed a new method to reduce the phoneme-related effects on speaker individuality.•Our method reduced up to 67.3% speaker recognition errors for the three languages.•The method can detect the individuality accurately with automatic phoneme alignment.

Feature extraction of speaker information from speech signals is a key procedure for exploring individual speaker characteristics and also the most critical part in a speaker recognition system, which needs to preserve individual information while attenuating linguistic information. However, it is difficult to separate individual from linguistic information in a given utterance. For this reason, we investigated a number of potential effects on speaker individual information that arise from differences in articulation due to speaker-specific morphology of the speech organs, comparing English, Chinese and Korean. We found that voiced and unvoiced phonemes have different frequency distributions in speaker information and these effects are consistent across the three languages, while the effect of nasal sounds on speaker individuality is language dependent. Because these differences are confounded with speaker individual information, feature extraction is negatively affected. Accordingly, a new feature extraction method is proposed to more accurately detect speaker individual information by suppressing phoneme-related effects, where the phoneme alignment is required once in constructing a filter bank for phoneme effect suppression, but is not necessary in processing feature extraction. The proposed method was evaluated by implementing it in GMM speaker models for speaker identification experiments. It is shown that the proposed approach outperformed both Mel Frequency Cepstrum Coefficient (MFCC) and the traditional F-ratio (FFCC). The use of the proposed feature has reduced recognition errors by 32.1–67.3% for the three languages compared with MFCC, and by 6.6–31% compared with FFCC. When combining an automatic phoneme aligner with the proposed method, the result demonstrated that the proposed method can detect speaker individuality with about the same accuracy as that based on manual phoneme alignment.

Keywords

MFCC Frequency warping Speech production Speaker identification