Article ID Journal Published Year Pages File Type
455876 Computers & Security 2014 15 Pages PDF
Abstract

Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features.

Related Topics
Physical Sciences and Engineering Computer Science Computer Networks and Communications
Authors
, , ,