Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6940891 | Pattern Recognition Letters | 2016 | 10 Pages |
Abstract
In this paper, we present a method for unsupervised morphological segmentation for multi-slot morphology based on affixality measurements. These measurements quantify three linguistic characteristics of affixes: (1) they combine with many low frequency word-bases (high combinatorial capacity), (2) although they are relatively few, they help to maximize the size of a lexicon (economy principle), i.e. speakers know more words by remembering fewer morphological items, and (3) they are very frequent, so they contain less information than word-bases (entropy), i.e. borders between affixes and stems can be detected by finding entropy peaks. Several experiments combining these measurements were conducted to find the best way to apply them to data. The best strategy consists in successive segmentation when the average of the affixality measurements surpasses a threshold of 0.5. Also, we compared this strategy with some state-of-the-art methods for unsupervised morphological segmentation (Morfessor and ParaMor). Our method outperformed these methods, when tested in a hand-made corpus. Results indicate that our proposal is competitive at least for the morphological segmentation of Spanish words.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Vision and Pattern Recognition
Authors
Carlos-Francisco Méndez-Cruz, Alfonso Medina-Urrea, Gerardo Sierra,