Ultrafast clustering algorithms for metagenomic sequence analysis
Top Cited Papers
Open Access
- 6 July 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Briefings in Bioinformatics
- Vol. 13 (6) , 656-668
- https://doi.org/10.1093/bib/bbs035
Abstract
The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.Keywords
This publication has 81 references indexed in Scilit:
- Efficient de novo assembly of single-cell bacterial genomes from short-read data setsNature Biotechnology, 2011
- A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysisBriefings in Bioinformatics, 2011
- EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing DataJournal of Computational Biology, 2010
- Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributionsNature Methods, 2010
- QIIME allows analysis of high-throughput community sequencing dataNature Methods, 2010
- CD-HIT Suite: a web server for clustering and comparing biological sequencesBioinformatics, 2010
- A core gut microbiome in obese and lean twinsNature, 2008
- UniRef: comprehensive and non-redundant UniProt reference clustersBioinformatics, 2007
- OrthoMCL: Identification of Ortholog Groups for Eukaryotic GenomesGenome Research, 2003
- BLAT—The BLAST-Like Alignment ToolGenome Research, 2002