A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Usingl-tuples

1 March 2011

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 18 (3) , 523-534
https://doi.org/10.1089/cmb.2010.0245

Abstract

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes—two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.

Keywords

This publication has 31 references indexed in Scilit:

TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach
BMC Bioinformatics, 2009
A core gut microbiome in obese and lean twins
Nature, 2008
A simple, fast, and accurate method of phylogenomic inference
Genome Biology, 2008
TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences
BMC Bioinformatics, 2004
Metagenomics: from acid mine to shining sea
Environmental Microbiology, 2004
Community structure and metabolism through reconstruction of microbial genomes from the environment
Nature, 2004
A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood
Systematic Biology, 2003
Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples
Genome Research, 2003
TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing
Bioinformatics, 2002
Genomic mapping by fingerprinting random clones: A mathematical analysis
Genomics, 1988