Genome-Wide Identification of Human Functional DNA Using a Neutral Indel Model

Open Access

13 January 2006

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 2 (1) , e5
https://doi.org/10.1371/journal.pcbi.0020005

Abstract

It has become clear that a large proportion of functional DNA in the human genome does not code for protein. Identification of this non-coding functional sequence using comparative approaches is proving difficult and has previously been thought to require deep sequencing of multiple vertebrates. Here we introduce a new model and comparative method that, instead of nucleotide substitutions, uses the evolutionary imprint of insertions and deletions (indels) to infer the past consequences of selection. The model predicts the distribution of indels under neutrality, and shows an excellent fit to human–mouse ancestral repeat data. Across the genome, many unusually long ungapped regions are detected that are unaccounted for by the neutral model, and which we predict to be highly enriched in functional DNA that has been subject to purifying selection with respect to indels. We use the model to determine the proportion under indel-purifying selection to be between 2.56% and 3.25% of human euchromatin. Since annotated protein-coding genes comprise only 1.2% of euchromatin, these results lend further weight to the proposition that more than half the functional complement of the human genome is non-protein-coding. The method is surprisingly powerful at identifying selected sequence using only two or three mammalian genomes. Applying the method to the human, mouse, and dog genomes, we identify 90 Mb of human sequence under indel-purifying selection, at a predicted 10% false-discovery rate and 75% sensitivity. As expected, most of the identified sequence represents unannotated material, while the recovered proportions of known protein-coding and microRNA genes closely match the predicted sensitivity of the method. The method's high sensitivity to functional sequence such as microRNAs suggest that as yet unannotated microRNA genes are enriched among the sequences identified. Futhermore, its independence of substitutions allowed us to identify sequence that has been subject to heterogeneous selection, that is, sequence subject to both positive selection with respect to substitutions and purifying selection with respect to indels. The ability to identify elements under heterogeneous selection enables, for the first time, the genome-wide investigation of positive selection on functional elements other than protein-coding genes. Despite the major impact of sequencing the human genome on our understanding of biology, a fundamental problem remains. Many of the genome's functional elements, particularly those that do not encode protein, are proving difficult to distinguish from neutrally evolving DNA. Lunter et al. introduce a method that exploits the evolutionary imprint of sequence insertions and deletions (so-called indels) to pinpoint functional DNA regions that have been subject to purifying selection. This method hinges on a simple theoretical prediction for the distribution of indels across the human genome. Despite its simplicity, the model shows an excellent fit to human and mouse alignments. This tight fit has been exploited to show that virtually all ancient transposable elements are evolving neutrally, which has long been suspected but not quantified. Indeed, the model estimates the probability that, among all alignable human sequence, a region has been purged of deleterious indels since the human–mouse split. This leads to the prediction that between 2.56% and 3.25% of the human genome sequence is functional. Importantly, the method is independent of conventional nucleotide substitution approaches, and thus immediately presents an initial opportunity to investigate the impact of positive selection on non-coding functional elements.

Keywords

This publication has 31 references indexed in Scilit:

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
Genome Research, 2005
Identification of hundreds of conserved and nonconserved human microRNAs
Nature Genetics, 2005
Ultraconserved elements in insect genomes: A highly conserved intronic sequence implicated in the control of homothorax mRNA splicing
Genome Research, 2005
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution
Nature, 2004
Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development
PLoS Biology, 2004
Finishing the euchromatic sequence of the human genome
Nature, 2004
End of the beginning
Nature, 2004
Initial sequencing and comparative analysis of the mouse genome
Nature, 2002
Initial sequencing and analysis of the human genome
Nature, 2001
Dynamic Programming Alignment Accuracy
Journal of Computational Biology, 1998