Exploring the nonlinear geometry of protein homology
- 1 August 2003
- journal article
- research article
- Published by Wiley in Protein Science
- Vol. 12 (8) , 1604-1612
- https://doi.org/10.1110/ps.0379403
Abstract
The explosion of biological data resulting from genomic and proteomic research has created a pressing need for data analysis techniques that work effectively on a large scale. An area of particular interest is the organization and visualization of large families of protein sequences. An increasingly popular approach is to embed the sequences into a low-dimensional Euclidean space in a way that preserves some predefined measure of sequence similarity. This method has been shown to produce maps that exhibit global order and continuity and reveal important evolutionary, structural, and functional relationships between the embedded proteins. However, protein sequences are related by evolutionary pathways that exhibit highly nonlinear geometry, which is invisible to classical embedding procedures such as multidimensional scaling (MDS) and nonlinear mapping (NLM). Here, we describe the use of stochastic proximity embedding (SPE) for producing Euclidean maps that preserve the intrinsic dimensionality and metric structure of the data. SPE extends previous approaches in two important ways: (1) It preserves only local relationships between closely related sequences, thus allowing the map to unfold and reveal its intrinsic dimension, and (2) it scales linearly with the number of sequences and therefore can be applied to very large protein families. The merits of the algorithm are illustrated using examples from the protein kinase and nuclear hormone receptor superfamilies.Keywords
This publication has 31 references indexed in Scilit:
- Nonlinear Dimensionality Reduction by Locally Linear EmbeddingScience, 2000
- A Global Geometric Framework for Nonlinear Dimensionality ReductionScience, 2000
- T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. ThorntonJournal of Molecular Biology, 2000
- Multiple sequence alignment: Algorithms and applicationsAdvances in Biophysics, 1999
- The protein kinase resourceTrends in Biochemical Sciences, 1997
- A new method for analyzing protein sequence relationships based on Sammon mapsProtein Science, 1997
- Kohonen map as a visualization tool for the analysis of protein sequences: multiple alignments, domains and segments of secondary structuresBioinformatics, 1996
- CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choiceNucleic Acids Research, 1994
- Self‐organized neural maps of human protein sequencesProtein Science, 1994
- Topology representing networksNeural Networks, 1994