Improved gap size estimation for scaffolding algorithms
Open Access
- 20 August 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 28 (17) , 2215-2222
- https://doi.org/10.1093/bioinformatics/bts441
Abstract
Motivation: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance. Results: In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners. Availability: A reference implementation is provided at https://github.com/SciLifeLab/gapest Supplementary information: Supplementary data are availible at Bioinformatics online. Contact:ksahlin@csc.kth.seThis publication has 18 references indexed in Scilit:
- GAGE: A critical evaluation of genome assemblies and assembly algorithmsGenome Research, 2011
- Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End SequencesJournal of Computational Biology, 2011
- An efficient simulator of 454 data using configurable statistical modelsBMC Research Notes, 2011
- Fast scaffolding with small independent mixed integer programsBioinformatics, 2011
- Scaffolding pre-assembled contigs using SSPACEBioinformatics, 2010
- SOPRA: Scaffolding algorithm for paired reads via statistical optimizationBMC Bioinformatics, 2010
- Finishing genomes with limited resources: lessons from an ensemble of microbial genomesBMC Genomics, 2010
- Genome assembly reborn: recent computational challengesBriefings in Bioinformatics, 2009
- Fast and accurate short read alignment with Burrows–Wheeler transformBioinformatics, 2009
- Bioinformatics challenges of new sequencing technologyPublished by Elsevier ,2008