کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
555961 1451269 2015 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin
ترجمه فارسی عنوان
بررسی مسائل مربوط به عدم تعادل داده ها آموزش و برچسب اشتباه در عملکرد جنگل های تصادفی برای طبقه بندی پوشش زمین های بزرگ با استفاده از حاشیه گروه
کلمات کلیدی
حاشیه گروه، داده های آموزشی، طبقه بندی، سنجش از دور، عدم تعادل، برچسب نادرست
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر سیستم های اطلاعاتی
چکیده انگلیسی

Studies have demonstrated the robust performance of the ensemble machine learning classifier, random forests, for remote sensing land cover classification, particularly across complex landscapes. This study introduces new ensemble margin criteria to evaluate the performance of Random Forests (RF) in the context of large area land cover classification and examines the effect of different training data characteristics (imbalance and mislabelling) on classification accuracy and uncertainty. The study presents a new margin weighted confusion matrix, which used in combination with the traditional confusion matrix, provides confidence estimates associated with correctly and misclassified instances in the RF classification model. Landsat TM satellite imagery, topographic and climate ancillary data are used to build binary (forest/non-forest) and multiclass (forest canopy cover classes) classification models, trained using sample aerial photograph maps, across Victoria, Australia. Experiments were undertaken to reveal insights into the behaviour of RF over large and complex data, in which training data are not evenly distributed among classes (imbalance) and contain systematically mislabelled instances. Results of experiments reveal that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing amounts of mislabelled training data. In general, balanced training data resulted in the lowest overall error rates for classification experiments (82.3% and 78.3% for the binary and multiclass experiments respectively). However, results of the study demonstrate that imbalance can be introduced to improve error rates of more difficult classes, without adversely affecting overall classification accuracy.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: ISPRS Journal of Photogrammetry and Remote Sensing - Volume 105, July 2015, Pages 155–168
نویسندگان
, , , ,