Selection bias in working with the top genes in supervised classification of tissue samples

Article ID	Journal	Published Year	Pages	File Type
1151209	Statistical Methodology	2006	13 Pages	PDF

Abstract

Currently there is much interest in using microarray gene-expression data to form prediction rules for the diagnosis of patient outcomes. A process of gene selection is usually carried out first to find those genes that are most useful according to some criterion for distinguishing between the given classes of tissue samples. However, there is a bias (selection bias) introduced in the estimate of the final version of a prediction rule that has been formed from a smaller subset of the genes that have been selected according to some optimality criterion. In this paper, we focus on the bias that arises when a full data set is not available in the first instance and the prediction rule is formed subsequently by working with the top-ranked genes from the full set. We demonstrate how large the subset of top genes must be before this selection bias is not of practical consequence.

Keywords

Cross-validation Gene selection Selection bias Support vector machine Error rates