کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
424857 | 685650 | 2016 | 14 صفحه PDF | دانلود رایگان |
• Category OSN users into 4 types according to their post behavior.
• Proposal the Poisson process model and hash model to collect fresh tweets.
• Discuss the parallelization technology of the Poisson process model.
• Design the centralized and distributed architectures of the crawler system.
• Conduct extensive experiments to verify the models and architectures.
Online social networks (OSNs) are among the hottest new services in recent years. OSNs maintain records of the lives of users, thereby providing potential resources for journalists, sociologists, and business analysts. Crawling data from social networks is a basic step during the processing and analysis of social network information. However, as OSNs become larger and the information on the network updates faster than the web pages, crawling is more difficult due to limitations in terms of bandwidth, politeness or etiquette, and computational power. To extract fresh information from OSNs in an efficient and effective manner, we propose a novel method for crawling and we also discuss a parallelization architecture for OSNs. To identify the features of OSNs, we collected data from real OSNs, analyzed them, and built a model to describe the behavior of users. Based on this model, we developed methods to predict the behavior of users. According to these predictions, we can schedule our crawler in a more reasonable manner and extract more fresh information using parallelization techniques. Our experimental results demonstrate that the proposed strategies can extract information from OSNs in an efficient and effective manner.
Journal: Future Generation Computer Systems - Volume 59, June 2016, Pages 33–46