کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
14977 1365 2015 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion
ترجمه فارسی عنوان
حداکثر پیش بینی لیپوکالین از طریق مجموعه تمرین های متمرکز و متنوع و هماهنگی تصمیم گیری
موضوعات مرتبط
مهندسی و علوم پایه مهندسی شیمی بیو مهندسی (مهندسی زیستی)
چکیده انگلیسی


• Unsupervised Kmeans preprocessing for balancing and diversifying training set.
• Enhanced classification of lipocalins by fusion of classifiers.
• Superior generalization on blind testing data sets.
• ReliefF based feature ranking.

Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method.

Figure optionsDownload as PowerPoint slide

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computational Biology and Chemistry - Volume 59, Part A, December 2015, Pages 101–110
نویسندگان
, ,