Performance of random forest when SNPs are in linkage disequilibrium

Open Access

5 March 2009

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 10 (1) , 78
https://doi.org/10.1186/1471-2105-10-78

Abstract

Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

Keywords

This publication has 15 references indexed in Scilit:

Conditional variable importance for random forests
BMC Bioinformatics, 2008
Empirical characterization of random forest variable importance measures
Computational Statistics & Data Analysis, 2008
Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks
BMC Proceedings, 2007
Picking single-nucleotide polymorphisms in forests
BMC Proceedings, 2007
PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses
American Journal of Human Genetics, 2007
GAB2 Alleles Modify Alzheimer's Risk in APOE ɛ4 Carriers
Published by Elsevier ,2007
Gene selection and classification of microarray data using random forest
BMC Bioinformatics, 2006
Identifying SNPs predictive of phenotype using random forests
Genetic Epidemiology, 2004
Score Tests for Association between Traits and Haplotypes when Linkage Phase Is Ambiguous
American Journal of Human Genetics, 2002
Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)
The Annals of Statistics, 2000