کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
382277 660754 2014 15 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Mining language variation using word using and collocation characteristics
ترجمه فارسی عنوان
تنوع زیستی معادن با استفاده از کلمه استفاده و ویژگی های همپوشانی
کلمات کلیدی
تنوع زبان، استخراج متن، نسبت رتبه فرکانس، مصلحت کلی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی


• Two metrics are proposed for extracting language variation characteristics.
• Two textual features are derived by employing the two proposed textual metrics.
• Using our features, language variation cues can be visualized.
• Our method can display language changes when semantics and syntax are unknown.
• Both entropy-based analysis and simulations prove the feasibility of our algorithm.

Two textual metrics “Frequency Rank” (FR) and “Intimacy” are proposed in this paper to measure the word using and collocation characteristics which are two important aspects of text style. The FR, derived from the local index numbers of terms in a sentences ordered by the global frequency of terms, provides single-term-level information. The Intimacy models relationship between a word and others, i.e. the closeness a term is to other terms in the same sentence. Two textual features “Frequency Rank Ratio (FRR)” and “Overall Intimacy (OI)” for capturing language variation are derived by employing the two proposed textual metrics. Using the derived features, language variation among documents can be visualized in a text space. Three corpora consisting of documents of diverse topics, genres, regions, and dates of writing are designed and collected to evaluate the proposed algorithms. Extensive simulations are conducted to verify the feasibility and performance of our implementation. Both theoretical analyses based on entropy and the simulations demonstrate the feasibility of our method. We also show the proposed algorithm can be used for visualizing the closeness of several western languages. Variation of modern English over time is also recognizable when using our analysis method. Finally, our method is compared to conventional text classification implementations. The comparative results indicate our method outperforms the others.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 41, Issue 17, 1 December 2014, Pages 7805–7819
نویسندگان
, ,