A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware

1 April 2010

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 17 (4) , 603-615
https://doi.org/10.1089/cmb.2009.0062

Abstract

Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this article, we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data so that error-free reads can be available before DNA fragment assembly, which is of high importance to many graph-based short-read assembly tools. The algorithm is based on spectral alignment and uses the Compute Unified Device Architecture (CUDA) programming model. To gain efficiency we are taking advantage of the CUDA texture memory using a space-efficient Bloom filter data structure for spectrum membership queries. We have tested the runtime and accuracy of our algorithm using real and simulated Illumina data for different read lengths, error rates, input sizes, and algorithmic parameters. Using a CUDA-enabled mass-produced GPU (available for less than US$400 at any local computer outlet), this results in speedups of 12–84 times for the parallelized error correction, and speedups of 3–63 times for both sequential preprocessing and parallelized error correction compared to the publicly available Euler-SR program. Our implementation is freely available for download from http://cuda-ec.sourceforge.net.

Keywords

This publication has 25 references indexed in Scilit:

De novo fragment assembly with short mate-paired reads: Does the read length matter?
Genome Research, 2008
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
Nucleic Acids Research, 2008
Bioinformatics challenges of new sequencing technology
Published by Elsevier ,2008
Velvet: Algorithms for de novo short read assembly using de Bruijn graphs
Genome Research, 2008
ALLPATHS: De novo assembly of whole-genome shotgun microreads
Genome Research, 2008
De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer
Genome Research, 2008
Using quality scores and longer reads improves accuracy of Solexa read mapping
BMC Bioinformatics, 2008
Short read fragment assembly of bacterial genomes
Genome Research, 2007
High-throughput sequence alignment using Graphics Processing Units
BMC Bioinformatics, 2007
SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing
Genome Research, 2007