Experiment-Specific Estimation of Peptide Identification Probabilities Using a Randomized Database
- 1 December 2007
- journal article
- research article
- Published by Mary Ann Liebert Inc in OMICS: A Journal of Integrative Biology
- Vol. 11 (4) , 351-366
- https://doi.org/10.1089/omi.2007.0040
Abstract
Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.Keywords
This publication has 31 references indexed in Scilit:
- Prediction of Error Associated with False-Positive Rate Determination for Peptide Identification in Large-Scale Proteomics Experiments Using a Combined Reverse and Forward Peptide Sequence Database StrategyJournal of Proteome Research, 2006
- Large Scale Analysis of MASCOT Results Using a Mass Accuracy-Based THreshold (MATH) Effectively Improves Data InterpretationJournal of Proteome Research, 2005
- Increased Identification of Peptides by Enhanced Data Processing of High-Resolution MALDI TOF/TOF Mass Spectra Prior to Database SearchingAnalytical Chemistry, 2004
- Large-Scale Simultaneous Hypothesis TestingJournal of the American Statistical Association, 2004
- Standard Mixtures for Proteome StudiesOMICS: A Journal of Integrative Biology, 2004
- Initial Proteome Analysis of Model MicroorganismHaemophilus influenzaeStrain Rd KW20Journal of Bacteriology, 2003
- Mass spectrometry-based proteomicsNature, 2003
- Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database SearchAnalytical Chemistry, 2002
- Probability-based protein identification by searching sequence databases using mass spectrometry dataElectrophoresis, 1999
- An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein databaseJournal of the American Society for Mass Spectrometry, 1994