Variable selection for multifactorial genomic data

Article ID	Journal	Published Year	Pages	File Type
1180944	Chemometrics and Intelligent Laboratory Systems	2012	10 Pages	PDF

Abstract

Dimension reduction techniques are used to explore genomic data. Due to the large number of variables (genes) included in this kind of studies, variable selection methods are needed to identify the most responsive genes in order to get a better interpretation of the results or to conduct more specific experiments. These methods should be consistent with the amount of signal in the data. For this purpose, we introduce a novel selection strategy called minAS and also adapt other existing strategies, such us Gamma approximation, resampling techniques, etc. All of them are based on studying the distribution of statistics measuring the importance of the variables in the model. These strategies have been applied to the ASCA-genes analysis framework and more generally to dimension reduction techniques as PCA. The performance of the different strategies was evaluated using simulated data. The best performing methods were then applied on an experimental dataset containing the transcriptomic profiles of human embryonic stem cells cultured under different oxygen concentrations. The ability of the methods to extract relevant biological information from the data is discussed.

► We compare several variable selection strategies in the analysis of genomic data. ► Selection is based on the importance of variables in the multivariate model used. ► We study the performance of selection methods on both simulated and real data. ► Our proposals Gamma and minAS are consistent with the amount of signal in the data.

Keywords

Variable selection Gene expression Principal component analysis