کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4374768 1617200 2016 7 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models
ترجمه فارسی عنوان
تعیین کمیت ارزش غربال اطلاعات سطح کاربر برای داده های بزرگ: مطالعه موردی با استفاده از مدل توزیع پستانداران
کلمات کلیدی
انفورماتیک تنوع زیستی؛ داده تمیز کردن. عملکرد SDM؛ MAXENT؛ پستانداران استرالیا؛ اطلاعات بزرگ
موضوعات مرتبط
علوم زیستی و بیوفناوری علوم کشاورزی و بیولوژیک بوم شناسی، تکامل، رفتار و سامانه شناسی
چکیده انگلیسی


• User-level data cleaning is seldom applied to biodiversity databases.
• We present a new framework to quantify the effect of data cleaning on SDMs.
• Data cleaning resulted in significant improvement in SDMs across all studied scales.
• The largest SDM improvement following data cleaning was for small mammals (1 g–100 g).
• We exemplify the value of case-specific, user-level data cleaning.

The recent availability of species occurrence data from numerous sources, standardized and connected within a single portal, has the potential to answer fundamental ecological questions. These aggregated big biodiversity databases are prone to numerous data errors and biases. The data-user is responsible for identifying these errors and assessing if the data are suitable for a given purpose. Complex technical skills are increasingly required for handling and cleaning biodiversity data, while biodiversity scientists possessing these skills are rare. Here, we estimate the effect of user-level data cleaning on species distribution model (SDM) performance. We implement several simple and easy-to-execute data cleaning procedures, and evaluate the change in SDM performance. Additionally, we examine if a certain group of species is more sensitive to the use of erroneous or unsuitable data. The cleaning procedures used in this research improved SDM performance significantly, across all scales and for all performance measures. The largest improvement in distribution models following data cleaning was for small mammals (1 g–100 g). Data cleaning at the user level is crucial when using aggregated occurrence data, and facilitating its implementation is a key factor in order to advance data-intensive biodiversity studies. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis, will not only improve the quality of biodiversity data, but will also impose a more appropriate usage of such data.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Ecological Informatics - Volume 34, July 2016, Pages 139–145
نویسندگان
, ,