Text-to-speech synthesis system with Arabic diacritic recognition system

Article ID	Journal	Published Year	Pages	File Type
559012	Computer Speech & Language	2015	18 Pages	PDF

Abstract

•We developed an Arabic text-to-speech system, including a diacritization system.•The speech synthesis system is based on statistical parametric.•We address the accuracy of diacritic and acoustic models.•We proposed a diacritization system based on the position of the current letter.•Neural network per unit type based synthesis system generates high speech quality.

Text-to-speech synthesis system has been widely studied for many languages. However, speech synthesis for Arabic language has not sufficient progresses and it is still in its first stage. Statistical parametric synthesis based on hidden Markov models was the most commonly applied approach for Arabic language. Recently, synthesized speech quality based on deep neural networks was found as intelligible as human voice. This paper describes a Text-To-Speech (TTS) synthesis system for modern standard Arabic language based on statistical parametric approach and Mel-cepstral coefficients. Deep neural networks achieved state-of-the-art performance in a wide range of tasks, including speech synthesis. Our TTS system includes a diacritization system which is very important for Arabic TTS application. Our diacritization system is also based on deep neural networks. In addition to the use deep techniques, different methods were also proposed to model the acoustic parameters in order to address the problem of acoustic models accuracy. They are based on linguistic and acoustic characteristics (e.g. letter position based diacritization system, unit types based synthesis system, diacritic marks based synthesis system) and based on deep learning techniques (stacked generalization techniques). Experimental results show that our diacritization system can generate a diacritized text with high accuracy. As regards the speech synthesis system, the experimental results and subjective evaluation show that our proposed method for synthesis system can generate intelligible and natural speech.

Keywords

Text-to-speech synthesis Deep neural networks Natural Language Processing