کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
455876 695595 2014 15 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Stable web spam detection using features based on lexical items
ترجمه فارسی عنوان
تشخیص هرزنامه دائمی با استفاده از ویژگی های مبتنی بر آیتم های واژگانی
کلمات کلیدی
شناسایی اسپم وب، ویژگی های تشخیص هرزنامه عبارات منظم، تجزیه و تحلیل محتوا، تجزیه و تحلیل آیتم های لاتین
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات
چکیده انگلیسی

Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computers & Security - Volume 46, October 2014, Pages 79–93
نویسندگان
, , ,