Statistical Calibration of the SEQUEST XCorr Function

10 March 2009

journal article
research article
Published by American Chemical Society (ACS) in Journal of Proteome Research

Vol. 8 (4) , 2106-2113
https://doi.org/10.1021/pr8011107

Abstract

Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments requires a score function that consistently ranks correct peptide−spectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function Xcorr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrum-specific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to Xcorr and the preliminary Sequest score function Sp. The protocol accounts for spectrum- and peptide-specific effects by calculating p values for each spectrum individually, using only that spectrum’s score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore, eliminating the need for an extra search against a decoy database. In addition, we show that the p values are better calibrated than their underlying scores; consequently, when ranking top-scoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.

Keywords

This publication has 24 references indexed in Scilit:

Estimating the Statistical Significance of Peptide Identifications from Shotgun Proteomics Experiments
Journal of Proteome Research, 2007
InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectra
Analytical Chemistry, 2005
Statistical Model for Large-Scale Peptide Identification in Databases from Tandem Mass Spectra Using SEQUEST
Analytical Chemistry, 2004
Statistical significance for genomewide studies
Proceedings of the National Academy of Sciences, 2003
A Hypergeometric Probability Model for Protein Identification and Validation Using Tandem Mass Spectral Data and Protein Sequence Databases
Analytical Chemistry, 2003
ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data
Proteomics, 2002
Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search
Analytical Chemistry, 2002
SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database
Bioinformatics, 2001
Large-scale analysis of the yeast proteome by multidimensional protein identification technology
Nature Biotechnology, 2001
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
Journal of the American Society for Mass Spectrometry, 1994