Lexicon expansion for latent variable grammars

Article ID	Journal	Published Year	Pages	File Type
534417	Pattern Recognition Letters	2014	9 Pages	PDF

Abstract

•We proposed a lexicon expansion approach to improve latent variable grammars.•The lexicon expansion is based on transductive graph propagation technique.•We constructed word-level k-NN similarity graph over labeled and unlabeled data.•We used an unnormalized propagation algorithm to infer emission probabilities.•Lexicon expansion with self-training can further improve latent variable grammars.

This study investigates the use of unlabeled data, i.e., raw texts, to strengthen latent variable probabilistic context-free grammars, in particular lexical models. A graph-based lexicon expansion approach is proposed to achieve this goal. It aims to discover additional lexical knowledge from a large amount of unlabeled data to help the syntax parsing. The proposed approach is based on a transductive graph-based label propagation technique. The approach builds k-nearest-neighbor (k-NN) similarity graphs over the words of labeled and unlabeled data, for propagating lexical emission probabilities. The intuition is that different word under similar syntactic environment should have approximate lexical emission distributions. The derived words, together with lexical emission probabilities, are incorporated into the parsing. This approach is very effective in parsing out-of-vocabulary (OOV) words. Empirical results for English, Chinese, and Portuguese revealed its effectiveness.

Keywords

Graph-based label propagation