Pathway analysis using random forests with bivariate node-split for survival outcomes
Open Access
- 18 November 2009
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 26 (2) , 250-258
- https://doi.org/10.1093/bioinformatics/btp640
Abstract
Motivation: There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted. Results: In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies. Availability: R package Pwayrfsurvival is available from URL: http://www.duke.edu/∼hp44/pwayrfsurvival.htm Contact:pathwayrf@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 68 references indexed in Scilit:
- Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selectionBioinformatics, 2009
- Building pathway clusters from Random Forests classification using class votesBMC Bioinformatics, 2008
- Transforming Growth Factor-β Can Suppress Tumorigenesis through Effects on the Putative Cancer Stem or Early Progenitor Cell and Committed Progeny in a Breast Cancer Xenograft ModelCancer Research, 2007
- Low E2F1 transcript levels are a strong determinant of favorable breast cancer outcomeBreast Cancer Research, 2007
- Decorrelation of the True and Estimated Classifier Errors in High-Dimensional SettingsEURASIP Journal on Bioinformatics and Systems Biology, 2007
- CDKN2A-positive breast cancers in young women from PolandBreast Cancer Research and Treatment, 2006
- Unbiased Recursive Partitioning: A Conditional Inference FrameworkJournal of Computational and Graphical Statistics, 2006
- Boosting for high-dimensional linear modelsThe Annals of Statistics, 2006
- Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profilesProceedings of the National Academy of Sciences, 2005
- An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survivalProceedings of the National Academy of Sciences, 2005