The role of pertinently diversified and balanced training as well as testing data sets in achieving the true performance of classifiers in predicting the antifreeze proteins

Article ID	Journal	Published Year	Pages	File Type
6865266	Neurocomputing	2018	12 Pages	PDF

Abstract

Antifreeze proteins (AFPs) are those proteins, which inhibit the ice nucleation process and thereby enabling certain organisms to survive under sub-zero temperature habitats. AFPs are supposed to be evolved from different types of protein families to perform the unique function of antifreeze activity and turn out to be the classical example of convergent evolution. The common sequence similarity search methods have failed to predict putative AFPs due to poor sequence and structural similarity that exists among the different sub-types of AFP. The machine learning techniques are the viable alternative approaches to predict putative AFPs. In this paper, we have discussed about the criteria (like apposite feature selection, balanced data sets and complete learning) that are needed to be taken into account for successful application of machine learning methods and implemented these criteria by using a clustering procedure in order to achieve the true performance of the learning algorithms. Diversified and representative training and testing data sets are very crucial for perfect learning as well as true testing of machine learning based prediction methods for two reasons: first is that a training dataset that lacks definable subset of input patterns makes prediction of patterns belonging to this subset either difficult or unfeasible (thus resulting in incomplete learning) and secondly a testing data set that lacks definable subset of input patterns does not tell about whether this subset of patterns can be correctly predicted by the classifier or not (thus resulting in incomplete testing). Moreover, balanced training and testing data sets are equally important for achieving the true (robust) performance of classifiers because a well-balanced training set eliminates bias of the classifier toward particular class/sub-class due to over-representation or under-representation of input patterns belonging to those classes/sub-classes. We have used K-means clustering algorithm for creating the diversified and balanced training as well as testing data sets, to overcome the shortcoming of random splitting, which cannot guarantee representative training and testing sets. The current clustering based optimal splitting criteria proved to be better than random splitting for creating training and testing set in terms of superior generalization and robust evaluation.

Keywords

K-means clustering Antifreeze proteins