کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
1165403 1491031 2013 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Merits of random forests emerge in evaluation of chemometric classifiers by external validation
موضوعات مرتبط
مهندسی و علوم پایه شیمی شیمی آنالیزی یا شیمی تجزیه
پیش نمایش صفحه اول مقاله
Merits of random forests emerge in evaluation of chemometric classifiers by external validation
چکیده انگلیسی


• Only 6.6% of 286 reviewed papers clearly used ‘external validation’ on classifiers.
• We tested 28 classifiers on NMR or MS data of different origin to the training set.
• Data came from 4 metabolomics or food projects, whose class numbers differed.
• Random forests were best on high-dimensional data, but used in only 4.5% of papers.
• Feature selection with ReliefF improved other machine learning classifiers.

Real-world applications will inevitably entail divergence between samples on which chemometric classifiers are trained and the unknowns requiring classification. This has long been recognized, but there is a shortage of empirical studies on which classifiers perform best in ‘external validation’ (EV), where the unknown samples are subject to sources of variation relative to the population used to train the classifier. Survey of 286 classification studies in analytical chemistry found only 6.6% that stated elements of variance between training and test samples. Instead, most tested classifiers using hold-outs or resampling (usually cross-validation) from the same population used in training. The present study evaluated a wide range of classifiers on NMR and mass spectra of plant and food materials, from four projects with different data properties (e.g., different numbers and prevalence of classes) and classification objectives. Use of cross-validation was found to be optimistic relative to EV on samples of different provenance to the training set (e.g., different genotypes, different growth conditions, different seasons of crop harvest). For classifier evaluations across the diverse tasks, we used ranks-based non-parametric comparisons, and permutation-based significance tests. Although latent variable methods (e.g., PLSDA) were used in 64% of the surveyed papers, they were among the less successful classifiers in EV, and orthogonal signal correction was counterproductive. Instead, the best EV performances were obtained with machine learning schemes that coped with the high dimensionality (914–1898 features). Random forests confirmed their resilience to high dimensionality, as best overall performers on the full data, despite being used in only 4.5% of the surveyed papers. Most other machine learning classifiers were improved by a feature selection filter (ReliefF), but still did not out-perform random forests.

Figure optionsDownload as PowerPoint slide

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Analytica Chimica Acta - Volume 801, 1 November 2013, Pages 22–33
نویسندگان
, , , , , , , , , , , ,