Classifying Lung Cancer Recurrence Time Using Novel Ensemble Method with Gene Network based Input Models

Article ID	Journal	Published Year	Pages	File Type
485892	Procedia Computer Science	2012	6 Pages	PDF

Abstract

An accurate prognostic model of a cancer patient after treatment can be useful in deciding the next course of treatment or efficacy of said treatment. Gene expression microarray data has been used to predict survival times [1], , or to classify the patient as having a good/poor prognosis [2], by predicting whether the patient belongs to the class that will have a recurrence of cancer before or after a certain period, typically 3 or 5 years. Microarrays typically contain thousands of gene expression probes and a typical study may only contain a few hundred patients or less. Typical regression techniques will fail to generalize, suffering from the ‘Curse of Dimensionality’, resulting in an over-fitted model that performs very well on the training data, and very poorly or validation data. Various feature selection/reduction methods have been used to reduce the dimensionality of the data and improve or facilitate a solution [3], . Gene expression is known to be modulated by the expression of other genes, forming a so-called gene network or pathway. Furthermore, several networks may affect the aggressiveness of the cancer simultaneously [4], . While past studies have selected features based on statistical methods alone [5], or have simply included ‘known cancer genes’, none to our knowledge have used classification models based on ensembles of models based on multiple known gene networks. Based on the data presented in Shedden, et. al. [6], this study uses a General Regression Neural Network (GRNN) Oracle ensemble that combines several Partial least squares (PLS) models trained to predict recurrence times from 12 different gene networks. We hypothesize that it is possible to correctly classify recurrence by combining the results based on the gene network models.