Article ID Journal Published Year Pages File Type
534417 Pattern Recognition Letters 2014 9 Pages PDF
Abstract

•We proposed a lexicon expansion approach to improve latent variable grammars.•The lexicon expansion is based on transductive graph propagation technique.•We constructed word-level k-NN similarity graph over labeled and unlabeled data.•We used an unnormalized propagation algorithm to infer emission probabilities.•Lexicon expansion with self-training can further improve latent variable grammars.

This study investigates the use of unlabeled data, i.e., raw texts, to strengthen latent variable probabilistic context-free grammars, in particular lexical models. A graph-based lexicon expansion approach is proposed to achieve this goal. It aims to discover additional lexical knowledge from a large amount of unlabeled data to help the syntax parsing. The proposed approach is based on a transductive graph-based label propagation technique. The approach builds k-nearest-neighbor (k-NN) similarity graphs over the words of labeled and unlabeled data, for propagating lexical emission probabilities. The intuition is that different word under similar syntactic environment should have approximate lexical emission distributions. The derived words, together with lexical emission probabilities, are incorporated into the parsing. This approach is very effective in parsing out-of-vocabulary (OOV) words. Empirical results for English, Chinese, and Portuguese revealed its effectiveness.

Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, , , , , ,