Article ID Journal Published Year Pages File Type
382277 Expert Systems with Applications 2014 15 Pages PDF
Abstract

•Two metrics are proposed for extracting language variation characteristics.•Two textual features are derived by employing the two proposed textual metrics.•Using our features, language variation cues can be visualized.•Our method can display language changes when semantics and syntax are unknown.•Both entropy-based analysis and simulations prove the feasibility of our algorithm.

Two textual metrics “Frequency Rank” (FR) and “Intimacy” are proposed in this paper to measure the word using and collocation characteristics which are two important aspects of text style. The FR, derived from the local index numbers of terms in a sentences ordered by the global frequency of terms, provides single-term-level information. The Intimacy models relationship between a word and others, i.e. the closeness a term is to other terms in the same sentence. Two textual features “Frequency Rank Ratio (FRR)” and “Overall Intimacy (OI)” for capturing language variation are derived by employing the two proposed textual metrics. Using the derived features, language variation among documents can be visualized in a text space. Three corpora consisting of documents of diverse topics, genres, regions, and dates of writing are designed and collected to evaluate the proposed algorithms. Extensive simulations are conducted to verify the feasibility and performance of our implementation. Both theoretical analyses based on entropy and the simulations demonstrate the feasibility of our method. We also show the proposed algorithm can be used for visualizing the closeness of several western languages. Variation of modern English over time is also recognizable when using our analysis method. Finally, our method is compared to conventional text classification implementations. The comparative results indicate our method outperforms the others.

Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, ,