Significance of Gapped Sequence Alignments
- 1 November 2008
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 15 (9) , 1187-1194
- https://doi.org/10.1089/cmb.2008.0125
Abstract
Measurement of the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 ± 0.3) × 10−1314. Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.Keywords
This publication has 11 references indexed in Scilit:
- Memory-efficient dynamic programming backtrace and pairwise local sequence alignmentBioinformatics, 2008
- A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance EstimationPLoS Computational Biology, 2008
- Statistical significance in biological sequence analysisBriefings in Bioinformatics, 2006
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Sampling rare events: Statistics of local sequence alignmentsPhysical Review E, 2002
- Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithmsGenomics, 1991
- Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proceedings of the National Academy of Sciences, 1990
- An introduction to hidden Markov modelsIEEE ASSP Magazine, 1986
- A general method applicable to the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology, 1970
- Error bounds for convolutional codes and an asymptotically optimum decoding algorithmIEEE Transactions on Information Theory, 1967