کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
516297 1449163 2011 11 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics
چکیده انگلیسی

PurposeEarly detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles.MethodsNaïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers.ResultsDaily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period.ConclusionsThe proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: International Journal of Medical Informatics - Volume 80, Issue 1, January 2011, Pages 56–66
نویسندگان
, , , , , , ,