Finite-state models in the alignment of macromolecules
- 1 July 1992
- journal article
- research article
- Published by Springer Nature in Journal of Molecular Evolution
- Vol. 35 (1) , 77-89
- https://doi.org/10.1007/bf00160262
Abstract
Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.Keywords
This publication has 31 references indexed in Scilit:
- An improved algorithm for matching biological sequencesPublished by Elsevier ,2004
- A Universal Prior for Integers and Estimation by Minimum Description LengthThe Annals of Statistics, 1983
- Evolutionary trees from DNA sequences: A maximum likelihood approachJournal of Molecular Evolution, 1981
- The theory and computation of evolutionary distances: Pattern recognitionJournal of Algorithms, 1980
- An application of information theory to genetic mutations and the matching of polypeptide sequencesJournal of Theoretical Biology, 1973
- An information measure for hierarchic classificationThe Computer Journal, 1973
- A general method applicable to the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology, 1970
- The information content of a multistate distributionJournal of Theoretical Biology, 1969
- On the Length of Programs for Computing Finite Binary SequencesJournal of the ACM, 1966
- A formal theory of inductive inference. Part IInformation and Control, 1964