Challenges of Big Data analysis
Top Cited Papers
Open Access
- 5 February 2014
- journal article
- review article
- Published by Oxford University Press (OUP) in National Science Review
- Vol. 1 (2) , 293-314
- https://doi.org/10.1093/nsr/nwt032
Abstract
Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.Keywords
All Related Versions
This publication has 100 references indexed in Scilit:
- Personal Omics Profiling Reveals Dynamic Molecular and Medical PhenotypesCell, 2012
- Discoidin domain receptor tyrosine kinases: new players in cancer progressionCancer and Metastasis Reviews, 2012
- Principled sure independence screening for Cox models with ultra-high-dimensional covariatesJournal of Multivariate Analysis, 2012
- Variance Estimation Using Refitted Cross-Validation in Ultrahigh Dimensional RegressionJournal of the Royal Statistical Society Series B: Statistical Methodology, 2011
- A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare VariantsHuman Heredity, 2010
- CUR matrix decompositions for improved data analysisProceedings of the National Academy of Sciences, 2009
- Sure Independence Screening for Ultrahigh Dimensional Feature SpaceJournal of the Royal Statistical Society Series B: Statistical Methodology, 2008
- Gene Expression Omnibus: NCBI gene expression and hybridization array data repositoryNucleic Acids Research, 2002
- Longitudinal data analysis using generalized linear modelsBiometrika, 1986
- A new look at the statistical model identificationIEEE Transactions on Automatic Control, 1974