Classifying disease outbreak reports using n-grams and semantic features

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
516345	1449176	2009	12 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

IE, Information extraction - استخراج اطلاعات Feature selection - انتخاب ویژگی Text classification - طبقه بندی متن Text mining - متن‌کاوی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

Classifying disease outbreak reports using n-grams and semantic features

چکیده انگلیسی

IntroductionThis paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger – the USAS tagger – to generate features.BackgroundWe outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus).Feature setsThree broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger.MethodologyThree standard machine learning algorithms – Naïve Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm – were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the χ2χ2 feature selection algorithm. Standard text classification performance metrics – Accuracy, Precision, Recall, Specificity and F-score – are reported.ResultsA feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naïve Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance.ConclusionThis study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: International Journal of Medical Informatics - Volume 78, Issue 12, December 2009, Pages e47–e58

نویسندگان

Mike Conway, Son Doan, Ai Kawazoe, Nigel Collier,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Classifying disease outbreak reports using n-grams and semantic features

دسترسی سریع

ارتباط

English Website