Fast text categorization using concise semantic analysis

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
536036	870439	2011	8 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Semantic analysis - تجزیه و تحلیل معنایی Text categorization - طبقه بندی متن Text representation - نمایش متن Dimensionality reduction - کاهش ابعاد، فروکاهی ابعاد

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو

پیش نمایش صفحه اول مقاله

Fast text categorization using concise semantic analysis

چکیده انگلیسی

Text representation is a necessary procedure for text categorization tasks. Currently, bag of words (BOW) is the most widely used text representation method but it suffers from two drawbacks. First, the quantity of words is huge; second, it is not feasible to calculate the relationship between words. Semantic analysis (SA) techniques help BOW overcome these two drawbacks by interpreting words and documents in a space of concepts. However, existing SA techniques are not designed for text categorization and often incur huge computing cost. This paper proposes a concise semantic analysis (CSA) technique for text categorization tasks. CSA extracts a few concepts from category labels and then implements concise interpretation on words and documents. These concepts are small in quantity and great in generality and tightly related to the category labels. Therefore, CSA preserves necessary information for classifiers with very low computing cost. To evaluate CSA, experiments on three data sets (Reuters-21578, 20-NewsGroup and Tancorp) were conducted and the results show that CSA reaches a comparable micro- and macro-F1 performance with BOW, if not better one. Experiments also show that CSA helps dimension sensitive learning algorithms such as k-nearest neighbor (kNN) to eliminate the “Curse of Dimensionality” and as a result reaches a comparable performance with support vector machine (SVM) in text categorization applications. In addition, CSA is language independent and performs equally well both in Chinese and English.

Research highlights
► The contributions of this paper are threefold.
► First, a new methodology to extract concepts from category labels is proposed. It is simple but efficient which is designed specifically for text categorization applications.
► Second, a new weighting method for the calculating of relationship degree between words and concepts is proposed. The new method takes the lengths of the documents into consideration and gives higher weights to appearances of words in short documents.
► Finally, the proposed approach is evaluated on three different corpora with two commonly used learning algorithms. The experimental results and analysis may provide useful information for future research on this topic.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition Letters - Volume 32, Issue 3, 1 February 2011, Pages 441–448

نویسندگان

Zhixing Li, Zhongyang Xiong, Yufang Zhang, Chunyong Liu, Kuan Li,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Fast text categorization using concise semantic analysis

دسترسی سریع

ارتباط

English Website