Novel peptide identification from tandem mass spectra using ESTs and sequence database compression
Open Access
- 1 January 2007
- journal article
- research article
- Published by Springer Nature in Molecular Systems Biology
- Vol. 3 (1) , 102
- https://doi.org/10.1038/msb4100142
Abstract
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. Traditional search engines, which match peptide sequences with tandem mass spectra to identify the samples’ proteins, use protein sequence databases to suggest peptide candidates for consideration. Although the acquisition of tandem mass spectra is not biased toward well‐understood protein isoforms, this computational strategy is failing to identify peptides from alternative splicing and coding SNP protein isoforms despite the acquisition of good‐quality tandem mass spectra. We propose, instead, that expressed sequence tags (ESTs) be searched. Ordinarily, such a strategy would be computationally infeasible due to the size of EST sequence databases; however, we show that a sophisticated sequence database compression strategy, applied to human ESTs, reduces the sequence database size approximately 35‐fold. Once compressed, our EST sequence database is comparable in size to other commonly used protein sequence databases, making routine EST searching feasible. We demonstrate that our EST sequence database enables the discovery of novel peptides in a variety of public data sets. Mol Syst Biol. 3: 102Keywords
This publication has 18 references indexed in Scilit:
- Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic DataMolecular & Cellular Proteomics, 2006
- Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly‐available databaseProteomics, 2005
- The International Protein Index: An integrated database for proteomics experimentsProteomics, 2004
- Improving Reproducibility and Sensitivity in Identifying Human Proteins by Shotgun ProteomicsAnalytical Chemistry, 2004
- UniProt: the Universal Protein knowledgebaseNucleic Acids Research, 2004
- The Application of New Software Tools to Quantitative Protein Profiling Via Isotope-coded Affinity Tag (ICAT) and Tandem Mass SpectrometryMolecular & Cellular Proteomics, 2003
- The Human Genome Browser at UCSCGenome Research, 2002
- Probability-based protein identification by searching sequence databases using mass spectrometry dataElectrophoresis, 1999
- An Efficient Implementation of a Scaling Minimum-Cost Flow AlgorithmJournal of Algorithms, 1997
- dbEST — database for “expressed sequence tags”Nature Genetics, 1993