Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies

Open Access

17 May 2007

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 23 (17) , 2210-2217
https://doi.org/10.1093/bioinformatics/btm267

Abstract

Motivation: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. Results: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. Availability: On request from the authors. Contact:Bret.Cooper@ars.usda.gov Supplementary information:http://bioinformatics.psb.ugent.be/

Keywords

This publication has 19 references indexed in Scilit:

Shotgun identification of proteins from uredospores of the bean rustUromyces appendiculatus
Proteomics, 2006
Randomized Sequence Databases for Tandem Mass Spectrometry Peptide and Protein Identification
OMICS: A Journal of Integrative Biology, 2005
Error-tolerant EST database searches by tandem mass spectrometry and multiTag software
Proteomics, 2005
An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis
Proteomics, 2005
Potential for False Positive Identifications from Large Databases through Tandem Mass Spectrometry
Journal of Proteome Research, 2004
Sequencing and comparison of yeast species to identify genes and regulatory elements
Nature, 2003
Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis
Proceedings of the National Academy of Sciences, 2000
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
Journal of the American Society for Mass Spectrometry, 1994
Protein sequencing by tandem mass spectrometry.
Proceedings of the National Academy of Sciences, 1986
On Generating Random Variates from an Empirical Distribution
A I I E Transactions, 1974