Article ID Journal Published Year Pages File Type
461735 Journal of Systems and Software 2013 13 Pages PDF
Abstract

•SAAD detects Web Spam pages based on new heuristics combined with a C4.5 classifier.•The study compares SAAD and previous studies in two well known and public datasets.•The paper discusses the effectiveness and efficiency of heuristics used by SAAD.•SAAD achieves a precision of 0.972 and recall of 0.850.•SAAD obtains results from 6 to 27 % better than the existing systems.

Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective.

Related Topics
Physical Sciences and Engineering Computer Science Computer Networks and Communications
Authors
, , ,