Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
4496072 | Journal of Theoretical Biology | 2015 | 7 Pages |
•A criteria of maximum dissimilarity–minimum entropy is proposed for identifying pathogenic barcodes.•Low entropy indicates a relatively consistent pattern to cause disease in case samples.•Large dissimilarity indicates significant distinction between cases and controls.•Large dissimilarity pathogenic barcodes with consistent pattern in cases are risky.•From the perspective of statistics, if a shorter barcode contributes to complex diseases, the complex diseases may be more common in population.
Complex diseases usually involve complex interactions between multiple loci. The artificial intelligent algorithm is a plausible strategy to evade combinatorial explosion. However, the randomness of solution of this algorithm loses decreases the confidence of biological researchers on this algorithm. Meanwhile, the lack of an efficient and effective measure to profile the distribution of cases and controls impedes the discovery of pathogenic epistasis. Here we present an efficient method called maximum dissimilarity–minimum entropy (MDME) to analyze breast cancer single-nucleotide polymorphism (SNP) data. The method searches risky barcodes, which to increase the odds ratio and relative risk of the breast cancer. This method based on the hypothesis that if a specific barcode is associated with a disease, then the barcode permits distinction of cases from controls and more importantly it shows a relative consistent pattern in cases. An analysis based on simulated dataset explains the necessity of minimum entropy. Experimental results show that our method can find the most risky barcode that contributes to breast cancer susceptibility. Our method may also mine several pathogenic barcodes that condition the different subtypes of cancer.