Stable web spam detection using features based on lexical items

Article ID	Journal	Published Year	Pages	File Type
455876	Computers & Security	2014	15 Pages	PDF

Abstract

Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features.

Keywords

Context analysis Regular expressions