Minimal-Risk Scoring Matrices for Sequence Analysis
- 1 January 1999
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 6 (2) , 219-235
- https://doi.org/10.1089/cmb.1999.6.219
Abstract
We introduce a minimal-risk method for estimating the frequencies of amino acids at conserved positions in a protein family. Our method, called minimal-risk estimation, finds the optimal weighting between a set of observed amino acid counts and a set of pseudofrequencies, which represent prior information about the frequencies. We compute the optimal weighting by minimizing the expected distance between the estimated frequencies and the true population frequencies, measured by either a squared-error or a relative-entropy metric. Our method accounts for the source of the pseudofrequencies, which arise either from the background distribution of amino acids or from applying a substitution matrix to the observed data. Our frequency estimates therefore depend on the size and composition of the observed data as well as the source of the pseudofrequencies. We convert our frequency estimates into minimal-risk scoring matrices for sequence analysis. A large-scale cross-validation study, involving 48 variants of seven methods, shows that the best performing method is minimal-risk estimation using the squared-error metric. Our method is implemented in the package EMATRIX, which is available on the Internet at http://motif.stanford.edu/ematrix.Keywords
This publication has 26 references indexed in Scilit:
- Hidden Markov modelsPublished by Elsevier ,2002
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- The PROSITE database, its status in 1997Nucleic Acids Research, 1997
- The SWISS-PROT protein sequence data bank and its new supplement TREMBLNucleic Acids Research, 1996
- PRINTS–a protein motif fingerprint databaseProtein Engineering, Design and Selection, 1994
- Exhaustive Matching of the Entire Protein Sequence DatabaseScience, 1992
- Prosite: a dictionary of sites and patterns in proteinsNucleic Acids Research, 1992
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Profile analysis: detection of distantly related proteins.Proceedings of the National Academy of Sciences, 1987
- Selection of DNA binding sites by regulatory proteinsJournal of Molecular Biology, 1987