Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory
- 29 January 2014
- journal article
- research article
- Published by Wiley in Biometrical Journal
- Vol. 56 (4) , 534-563
- https://doi.org/10.1002/bimj.201300068
Abstract
Probability estimation for binary and multicategory outcome using logistic and multinomial logistic regression has a long‐standing tradition in biostatistics. However, biases may occur if the model is misspecified. In contrast, outcome probabilities for individuals can be estimated consistently with machine learning approaches, includingk‐nearest neighbors (k‐NN), bagged nearest neighbors (b‐NN), random forests (RF), and support vector machines (SVM). Because machine learning methods are rarely used by applied biostatisticians, the primary goal of this paper is to explain the concept of probability estimation with these methods and to summarize recent theoretical findings. Probability estimation ink‐NN, b‐NN, and RF can be embedded into the class of nonparametric regression learning machines; therefore, we start with the construction of nonparametric regression estimates and review results on consistency and rates of convergence. In SVMs, outcome probabilities for individuals are estimated consistently by repeatedly solving classification problems. For SVMs we review classification problem and then dichotomous probability estimation. Next we extend the algorithms for estimating probabilities usingk‐NN, b‐NN, and RF to multicategory outcomes and discuss approaches for the multicategory probability estimation problem using SVM. In simulation studies for dichotomous and multicategory dependent variables we demonstrate the general validity of the machine learning methods and compare it with logistic regression. However, each method fails in at least one simulation scenario. We conclude with a discussion of the failures and give recommendations for selecting and tuning the methods. Applications to real data and example code are provided in a companion article (doi:10.1002/bimj.201300077).Keywords
Funding Information
- German Region of the International Biometric Society
- European Union (BiomarCare) (HEALTH-2011-278913)
- German Ministry of Education and Research (CARDomics) (01KU0908A, 01KU0908B, 0315536F)
This publication has 86 references indexed in Scilit:
- Multicategory reclassification statistics for assessing improvements in diagnostic accuracyBiostatistics, 2012
- Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble‐based methods?Biometrical Journal, 2012
- Risk estimation and risk prediction using machine-learning methodsHuman Genetics, 2012
- Random forests for genomic data analysisGenomics, 2012
- Cardiovascular Risk Estimation in 2012: Lessons Learned and Applicability to the HIV PopulationThe Journal of Infectious Diseases, 2012
- European Guidelines on cardiovascular disease prevention in clinical practice (version 2012): The Fifth Joint Task Force of the European Society of Cardiology and Other Societies on Cardiovascular Disease Prevention in Clinical Practice (constituted by representatives of nine societies and by invited experts) * Developed with the special contribution of the European Association for Cardiovascular Prevention & Rehabilitation (EACPR)European Heart Journal, 2012
- Non-crossing large-margin probability estimation and its application to robust SVM via preconditioningStatistical Methodology, 2011
- On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional dataBioinformatics, 2010
- Improving propensity score weighting using machine learningStatistics in Medicine, 2009
- An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests.Psychological Methods, 2009