On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data
Open Access
- 26 May 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 26 (14) , 1752-1758
- https://doi.org/10.1093/bioinformatics/btq257
Abstract
Motivation: Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene–gene and gene–environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden. Results: Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions. Availability: The RJ software package is freely available at http://www.randomjungle.org Contact:inke.koenig@imbs.uni-luebeck.de; ziegler@imbs.uni-luebeck.de Supplementary information: Supplementary data are available at Bioinformatics online.This publication has 53 references indexed in Scilit:
- Bioinformatics challenges for genome-wide association studiesBioinformatics, 2010
- Finding the missing heritability of complex diseasesNature, 2009
- Detecting gene–gene interactions that underlie human diseasesNature Reviews Genetics, 2009
- Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's diseaseNature Genetics, 2008
- Parallels between Global Transcriptional Programs of Polarizing Caco-2 Intestinal Epithelial Cells In Vitro and Gene Expression Programs in Normal Colon and Colon CancerMolecular Biology of the Cell, 2007
- Genomewide Association Analysis of Coronary Artery DiseaseNew England Journal of Medicine, 2007
- A new multipoint method for genome-wide association studies by imputation of genotypesNature Genetics, 2007
- Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controlsNature, 2007
- Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesisNature Genetics, 2007
- Identifying SNPs predictive of phenotype using random forestsGenetic Epidemiology, 2004