Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
485441 | Procedia Computer Science | 2016 | 6 Pages |
Abstract
We train neural networks of varying depth with a loss function which imposes the output representations to have a temporal profile which looks like that of phonemes. We show that a simple loss function which maximizes the dissimilarity between near frames and long distance frames helps to construct a speech embedding that improves phoneme discriminability, both within and across speakers, even though the loss function only uses within speaker information. However, with too deep an architecture, this loss function yields overfitting, suggesting the need for more data and/or regularization.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Science (General)
Authors
Gabriel Synnaeve, Emmanuel Dupoux,