The properties of high-dimensional data spaces: implications for exploring gene and protein expression data
Top Cited Papers
- 1 January 2008
- journal article
- review article
- Published by Springer Nature in Nature Reviews Cancer
- Vol. 8 (1) , 37-49
- https://doi.org/10.1038/nrc2294
Abstract
The application of several high-throughput genomic and proteomic technologies to address questions in cancer diagnosis, prognosis and prediction generate high-dimensional data sets. The multimodality of high-dimensional cancer data, for example, as a consequence of the heterogeneous and dynamic nature of cancer tissues, the concurrent expression of multiple biological processes and the diverse and often tissue-specific activities of single genes, can confound both simple mechanistic interpretations of cancer biology and the generation of complete or accurate gene signal transduction pathways or networks. The mathematical and statistical properties of high-dimensional data spaces are often poorly understood or inadequately considered. This can be particularly challenging for the common scenario where the number of data points obtained for each specimen greatly exceed the number of specimens. Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations. The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools. Owing to the 'curse of dimensionality' phenomenon and its negative impact on generalization performance, for example, estimation instability, model overfitting and local convergence, the large estimation error from complex statistical models can easily compromise the prediction advantage provided by their greater representation power. Conversely, simpler statistical models may produce more reproducible predictions but their predictions may not always be adequate. Some machine learning methods address the 'curse of dimensionality' in high-dimensional data analysis through feature selection and dimensionality reduction, leading to better data visualization and improved classification. It is important to ensure that the generalization capability of classifiers derived by supervised learning methods from high-dimensional data before using them for cancer diagnosis, prognosis or prediction. Although this can be assessed initially through cross-validation methods, a more rigorous approach is needed, that is, to validate classifier performance using a blind validation data set(s) that was not used during supervised learning.Keywords
This publication has 115 references indexed in Scilit:
- Transcription regulation by mutant p53Oncogene, 2007
- Genomic and transcriptional aberrations linked to breast cancer pathophysiologiesPublished by Elsevier ,2006
- Genome-wide analysis of estrogen receptor binding sitesNature Genetics, 2006
- Antiestrogen resistance in breast cancer and the role of estrogen receptor signalingOncogene, 2003
- A Gene-Expression Signature as a Predictor of Survival in Breast CancerNew England Journal of Medicine, 2002
- Prognostic significance of a complete pathological response after induction chemotherapy in operable breast cancerBritish Journal of Cancer, 2002
- Gene expression profiling predicts clinical outcome of breast cancerNature, 2002
- The control of the false discovery rate in multiple testing under dependencyThe Annals of Statistics, 2001
- Genome-Wide Location and Function of DNA Binding ProteinsScience, 2000
- A critical evaluation of the mechanisms of action proposed for the antitumor effects of the anthracycline antibiotics adriamycin and daunorubicinBiochemical Pharmacology, 1999