Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases

17 February 2004

journal article
research article
Published by American Chemical Society (ACS) in Analytical Chemistry

Vol. 76 (6) , 1664-1671
https://doi.org/10.1021/ac035112y

Abstract

The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.

Keywords

This publication has 18 references indexed in Scilit:

Nuclear Membrane Proteins with Potential Disease Links Found by Subtractive Proteomics
Science, 2003
A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry
Analytical Chemistry, 2003
A Hypergeometric Probability Model for Protein Identification and Validation Using Tandem Mass Spectral Data and Protein Sequence Databases
Analytical Chemistry, 2003
PRISM, a Generic Large Scale Proteomic Investigation Strategy for Mammals*S
Molecular & Cellular Proteomics, 2003
Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search
Analytical Chemistry, 2002
Shotgun identification of protein modifications from protein complexes and lens tissue
Proceedings of the National Academy of Sciences, 2002
Qscore: An algorithm for evaluating SEQUEST database search results
Journal of the American Society for Mass Spectrometry, 2002
Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature, 2002
Probability-based protein identification by searching sequence databases using mass spectrometry data
Electrophoresis, 1999
Error-Tolerant Identification of Peptides in Sequence Databases by Peptide Sequence Tags
Analytical Chemistry, 1994