Toward an accurate statistics of gapped alignments
- 1 January 2005
- journal article
- Published by Springer Nature in Bulletin of Mathematical Biology
- Vol. 67 (1) , 169-191
- https://doi.org/10.1016/j.bulm.2004.07.001
Abstract
Sequence alignment has been an invaluable tool for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13–140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S max, resp. the maximum free energy F max, for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S max > x) ∼ exp(−λx) for maximum-score alignment and P(F max > x) ∼ exp(−λx) for some classes of probabilistic alignment. We derive an exact expression for λ for particular probabilistic alignments. This result is then used to obtain accurate λ values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions.Keywords
This publication has 0 references indexed in Scilit: