The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Top Cited Papers

1 January 2008

journal article
review article
Published by Springer Nature in Nature Reviews Cancer

Vol. 8 (1) , 37-49
https://doi.org/10.1038/nrc2294

Abstract

The application of several high-throughput genomic and proteomic technologies to address questions in cancer diagnosis, prognosis and prediction generate high-dimensional data sets. The multimodality of high-dimensional cancer data, for example, as a consequence of the heterogeneous and dynamic nature of cancer tissues, the concurrent expression of multiple biological processes and the diverse and often tissue-specific activities of single genes, can confound both simple mechanistic interpretations of cancer biology and the generation of complete or accurate gene signal transduction pathways or networks. The mathematical and statistical properties of high-dimensional data spaces are often poorly understood or inadequately considered. This can be particularly challenging for the common scenario where the number of data points obtained for each specimen greatly exceed the number of specimens. Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations. The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools. Owing to the 'curse of dimensionality' phenomenon and its negative impact on generalization performance, for example, estimation instability, model overfitting and local convergence, the large estimation error from complex statistical models can easily compromise the prediction advantage provided by their greater representation power. Conversely, simpler statistical models may produce more reproducible predictions but their predictions may not always be adequate. Some machine learning methods address the 'curse of dimensionality' in high-dimensional data analysis through feature selection and dimensionality reduction, leading to better data visualization and improved classification. It is important to ensure that the generalization capability of classifiers derived by supervised learning methods from high-dimensional data before using them for cancer diagnosis, prognosis or prediction. Although this can be assessed initially through cross-validation methods, a more rigorous approach is needed, that is, to validate classifier performance using a blind validation data set(s) that was not used during supervised learning.

Keywords

This publication has 115 references indexed in Scilit:

Transcription regulation by mutant p53
Oncogene, 2007
Genomic and transcriptional aberrations linked to breast cancer pathophysiologies
Published by Elsevier ,2006
Genome-wide analysis of estrogen receptor binding sites
Nature Genetics, 2006
Antiestrogen resistance in breast cancer and the role of estrogen receptor signaling
Oncogene, 2003
A Gene-Expression Signature as a Predictor of Survival in Breast Cancer
New England Journal of Medicine, 2002
Prognostic significance of a complete pathological response after induction chemotherapy in operable breast cancer
British Journal of Cancer, 2002
Gene expression profiling predicts clinical outcome of breast cancer
Nature, 2002
The control of the false discovery rate in multiple testing under dependency
The Annals of Statistics, 2001
Genome-Wide Location and Function of DNA Binding Proteins
Science, 2000
A critical evaluation of the mechanisms of action proposed for the antitumor effects of the anthracycline antibiotics adriamycin and daunorubicin
Biochemical Pharmacology, 1999