کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
385996 | 660876 | 2011 | 9 صفحه PDF | دانلود رایگان |

The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naïve Bayes and k nearest neighbor classifiers.
Research highlights
► Automatic Web page classifiers are essential to vertical search engines.
► Number of features (i.e., dimension) in Web page classification problem is high.
► To classify high dimensional data, genetic algorithms, Naïve Bayes, and k nearest neighbor algorithms can be applied.
► If training dataset includes at least 50% negative documents, genetic algorithms perform best.
► When both HTML tags and terms are used together as features, classification accuracy increases.
Journal: Expert Systems with Applications - Volume 38, Issue 4, April 2011, Pages 3407–3415