A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction
Top Cited Papers
- 23 February 2007
- journal article
- research article
- Published by Wiley in Genetic Epidemiology
- Vol. 31 (4) , 306-315
- https://doi.org/10.1002/gepi.20211
Abstract
Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+ specificity) /2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800,1600). Each dataset was generated with different ratios of cases to controls (1 : 1, 1: 2, 1: 4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.Keywords
This publication has 51 references indexed in Scilit:
- A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibilityJournal of Theoretical Biology, 2006
- A Testing Framework for Identifying Susceptibility Genes in the Presence of EpistasisAmerican Journal of Human Genetics, 2006
- Identifying susceptibility genes by using joint tests of association and linkage and accounting for epistasisBMC Genomic Data, 2005
- A novel method to identify gene–gene effects in nuclear families: the MDR‐PDTGenetic Epidemiology, 2005
- Combinatorial PharmacogeneticsNature Reviews Drug Discovery, 2005
- Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene diseaseHuman Molecular Genetics, 2005
- Genetics, statistics and human disease: analytical retooling for complexityTrends in Genetics, 2004
- Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneityGenetic Epidemiology, 2003
- Canalization in evolutionary genetics: a stabilizing theory?BioEssays, 2000
- XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance.Transactions of the Royal Society of Edinburgh, 1919