A Temporal Coherence Loss Function for Learning Unsupervised Acoustic Embeddings

Article ID	Journal	Published Year	Pages	File Type
485441	Procedia Computer Science	2016	6 Pages	PDF

Abstract

We train neural networks of varying depth with a loss function which imposes the output representations to have a temporal profile which looks like that of phonemes. We show that a simple loss function which maximizes the dissimilarity between near frames and long distance frames helps to construct a speech embedding that improves phoneme discriminability, both within and across speakers, even though the loss function only uses within speaker information. However, with too deep an architecture, this loss function yields overfitting, suggesting the need for more data and/or regularization.

Keywords

Feature extraction Temporal coherence Speech recognition Unsupervised learning