Article ID Journal Published Year Pages File Type
485441 Procedia Computer Science 2016 6 Pages PDF
Abstract

We train neural networks of varying depth with a loss function which imposes the output representations to have a temporal profile which looks like that of phonemes. We show that a simple loss function which maximizes the dissimilarity between near frames and long distance frames helps to construct a speech embedding that improves phoneme discriminability, both within and across speakers, even though the loss function only uses within speaker information. However, with too deep an architecture, this loss function yields overfitting, suggesting the need for more data and/or regularization.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)
Authors
, ,