Uncertainty in homology inferences: Assessing and improving genomic sequence alignment
Open Access
- 11 December 2007
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 18 (2) , 298-309
- https://doi.org/10.1101/gr.6725608
Abstract
Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.Keywords
This publication has 63 references indexed in Scilit:
- Measuring the accuracy of genome-size multiple alignmentsGenome Biology, 2007
- Parametric Alignment of Drosophila GenomesPLoS Computational Biology, 2006
- Evolution at the nucleotide level: the problem of multiple whole-genome alignmentHuman Molecular Genetics, 2006
- Pseudo-likelihood for Non-reversible Nucleotide Substitution Models with Neighbour Dependent RatesStatistical Applications in Genetics and Molecular Biology, 2006
- Genome-Wide Identification of Human Functional DNA Using a Neutral Indel ModelPLoS Computational Biology, 2006
- Predicting Reliable Regions in Protein Alignments from Sequence ProfilesJournal of Molecular Biology, 2003
- CLUSTAL: a package for performing multiple sequence alignment on a microcomputerPublished by Elsevier ,2003
- Initial sequencing and comparative analysis of the mouse genomeNature, 2002
- Parametric optimization of sequence alignmentAlgorithmica, 1994
- Suboptimal sequence alignment in molecular biologyJournal of Molecular Biology, 1991