A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Top Cited Papers
- 7 January 2011
- journal article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 27 (6) , 764-770
- https://doi.org/10.1093/bioinformatics/btr011
Abstract
Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: gmarcais@umd.edu Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 19 references indexed in Scilit:
- Quake: quality-aware detection and correction of sequencing errorsGenome Biology, 2010
- Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and AnalysisPLoS Biology, 2010
- The sequence and de novo assembly of the giant panda genomeNature, 2009
- A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomesBMC Genomics, 2008
- MapReduceCommunications of the ACM, 2008
- RAP: a new computer program for de novo identification of repeated sequences in whole genomesBioinformatics, 2004
- MUSCLE: multiple sequence alignment with high accuracy and high throughputNucleic Acids Research, 2004
- FORRepeats: detects repeats on entire chromosomes and between genomesBioinformatics, 2003
- Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2Genome Research, 2003
- Annotating Large Genomes With Exact Word MatchesGenome Research, 2003