Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies
Open Access
- 17 May 2007
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 23 (17) , 2210-2217
- https://doi.org/10.1093/bioinformatics/btm267
Abstract
Motivation: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. Results: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. Availability: On request from the authors. Contact:Bret.Cooper@ars.usda.gov Supplementary information:http://bioinformatics.psb.ugent.be/Keywords
This publication has 19 references indexed in Scilit:
- Shotgun identification of proteins from uredospores of the bean rustUromyces appendiculatusProteomics, 2006
- Randomized Sequence Databases for Tandem Mass Spectrometry Peptide and Protein IdentificationOMICS: A Journal of Integrative Biology, 2005
- Error-tolerant EST database searches by tandem mass spectrometry and multiTag softwareProteomics, 2005
- An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysisProteomics, 2005
- Potential for False Positive Identifications from Large Databases through Tandem Mass SpectrometryJournal of Proteome Research, 2004
- Sequencing and comparison of yeast species to identify genes and regulatory elementsNature, 2003
- Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysisProceedings of the National Academy of Sciences, 2000
- An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein databaseJournal of the American Society for Mass Spectrometry, 1994
- Protein sequencing by tandem mass spectrometry.Proceedings of the National Academy of Sciences, 1986
- On Generating Random Variates from an Empirical DistributionA I I E Transactions, 1974