Syntactic Ngrams as Keystructures Reflecting Typical Syntactic Patterns of Corpora in Finnish

Article ID	Journal	Published Year	Pages	File Type
1110945	Procedia - Social and Behavioral Sciences	2015	9 Pages	PDF

Abstract

This article studies syntactic ngrams, i.e. little subtrees of dependency syntax analyses, as keystructures reflecting syntactic characteristics of corpora. While traditional keywords correspond to statistically more or less frequent words of a corpus and are often informative on the corpus topic and style, unlexicalized syntactic ngrams applied in this study extend the level of description beyond individual words to sequences of syntactic elements. The article analyzes the utility of these sequences in corpus description and gives first results on the structural characteristics reflected by them in the studied texts, including Finnish literature, Internet forum discussions from the major Finnish social networking website and Internet discussions following the news and editorials of the major Finnish newspaper's website. The syntactic ngrams are produced with the freely available Finnish Dependency Parser and Ngram Builder and the keystructures analyzed with a linear classifier. The results suggest that syntactic ngrams illustrate both topical features, such as names and Internet urls discussed in the corpora, as well as structural characteristics, such as subject-verb combinations, negations and informal sentence structures, thus both generalizing the information given by traditional keywords from individual words to concepts and providing new knowledge about typical constructions not reached by lexemes.