کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
568602 1452031 2015 15 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
ترجمه فارسی عنوان
طبقه بندی استرس واژگانی با استفاده از ویژگی های طیفی و اسکودیک برای سیستم های یادگیری زبان کامپیوتری
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال
چکیده انگلیسی


• A system for classification of lexical stress for language learners is proposed.
• It successfully combines spectral and prosodic characteristics using GMMs.
• Models are learned on native speech, which does not require manual labeling.
• A method for controlling the operating point of the system is proposed.
• We achieve a 20% error rate on Japanese children speaking English.

We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Speech Communication - Volume 69, May 2015, Pages 31–45
نویسندگان
, , , , , ,