Efficient de novo assembly of large genomes using compressed data structures
Top Cited Papers
Open Access
- 7 December 2011
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 22 (3) , 549-556
- https://doi.org/10.1101/gr.126953.111
Abstract
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.Keywords
This publication has 28 references indexed in Scilit:
- Assemblathon 1: A competitive assessment of de novo short read assembly methodsGenome Research, 2011
- A framework for variation discovery and genotyping using next-generation DNA sequencing dataNature Genetics, 2011
- High-quality draft assemblies of mammalian genomes from massively parallel sequence dataProceedings of the National Academy of Sciences, 2010
- Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing TechnologiesJournal of Computational Biology, 2010
- Efficient construction of an assembly string graph using the FM-indexBioinformatics, 2010
- Fast and accurate long-read alignment with Burrows–Wheeler transformBioinformatics, 2010
- Genome 10K: A Proposal to Obtain Whole-Genome Sequence for 10 000 Vertebrate SpeciesJournal of Heredity, 2009
- Fast and accurate short read alignment with Burrows–Wheeler transformBioinformatics, 2009
- Accurate whole human genome sequencing using reversible terminator chemistryNature, 2008
- Short read fragment assembly of bacterial genomesGenome Research, 2007