PSI-BLAST pseudocounts and the minimum description length principle
Open Access
- 16 December 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 37 (3) , 815-824
- https://doi.org/10.1093/nar/gkn981
Abstract
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.Keywords
This publication has 32 references indexed in Scilit:
- Pseudocounts for transcription factor binding sitesNucleic Acids Research, 2008
- Database resources of the National Center for Biotechnology InformationNucleic Acids Research, 2007
- SCOP: A structural classification of proteins database for the investigation of sequences and structuresPublished by Elsevier ,2006
- Amino acid substitution matrices from an information theoretic perspectivePublished by Elsevier ,2005
- Position-based sequence weightsPublished by Elsevier ,2004
- The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003Nucleic Acids Research, 2003
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Maximum Discrimination Hidden Markov Models of Sequence ConsensusJournal of Computational Biology, 1995
- Volume changes in protein evolutionJournal of Molecular Biology, 1994
- Weighting aligned protein or nucleic acid sequences to correct for unequal representationJournal of Molecular Biology, 1990