Experiment-Specific Estimation of Peptide Identification Probabilities Using a Randomized Database

1 December 2007

journal article
research article
Published by Mary Ann Liebert Inc in OMICS: A Journal of Integrative Biology

Vol. 11 (4) , 351-366
https://doi.org/10.1089/omi.2007.0040

Abstract

Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.

Keywords

This publication has 31 references indexed in Scilit:

Prediction of Error Associated with False-Positive Rate Determination for Peptide Identification in Large-Scale Proteomics Experiments Using a Combined Reverse and Forward Peptide Sequence Database Strategy
Journal of Proteome Research, 2006
Large Scale Analysis of MASCOT Results Using a Mass Accuracy-Based THreshold (MATH) Effectively Improves Data Interpretation
Journal of Proteome Research, 2005
Increased Identification of Peptides by Enhanced Data Processing of High-Resolution MALDI TOF/TOF Mass Spectra Prior to Database Searching
Analytical Chemistry, 2004
Large-Scale Simultaneous Hypothesis Testing
Journal of the American Statistical Association, 2004
Standard Mixtures for Proteome Studies
OMICS: A Journal of Integrative Biology, 2004
Initial Proteome Analysis of Model MicroorganismHaemophilus influenzaeStrain Rd KW20
Journal of Bacteriology, 2003
Mass spectrometry-based proteomics
Nature, 2003
Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search
Analytical Chemistry, 2002
Probability-based protein identification by searching sequence databases using mass spectrometry data
Electrophoresis, 1999
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
Journal of the American Society for Mass Spectrometry, 1994