کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
516033 867164 2006 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Exploiting structural information for semi-structured document categorization
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Exploiting structural information for semi-structured document categorization
چکیده انگلیسی

This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 42, Issue 3, May 2006, Pages 679–694
نویسندگان
, ,