Limitations of next-generation genome sequence assembly
Top Cited Papers
- 21 November 2010
- journal article
- research article
- Published by Springer Nature in Nature Methods
- Vol. 8 (1) , 61-65
- https://doi.org/10.1038/nmeth.1527
Abstract
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.Keywords
This publication has 27 references indexed in Scilit:
- Assembly of large genomes using second-generation sequencingGenome Research, 2010
- Characterization of missing human genome sequences and copy-number polymorphic insertionsNature Methods, 2010
- Genome assembly quality: Assessment and improvement using the neutral indel modelGenome Research, 2010
- Complete Khoisan and Bantu genomes from southern AfricaNature, 2010
- Decoding a national treasureNature, 2010
- De novo assembly of human genomes with massively parallel short read sequencingGenome Research, 2009
- The sequence and de novo assembly of the giant panda genomeNature, 2009
- The Sequence of the Human GenomeScience, 2001
- A Whole-Genome Assembly of DrosophilaScience, 2000
- A Greedy Algorithm for Aligning DNA SequencesJournal of Computational Biology, 2000