Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

Open Access

10 October 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 3 (10) , e3375
https://doi.org/10.1371/journal.pone.0003375

Abstract

The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.

Keywords

This publication has 40 references indexed in Scilit:

Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 2007
UniRef: comprehensive and non-redundant UniProt reference clusters
Bioinformatics, 2007
Community structure and metabolism through reconstruction of microbial genomes from the environment
Nature, 2004
UniProt: the Universal Protein knowledgebase
Nucleic Acids Research, 2004
An efficient algorithm for large-scale detection of protein families
Nucleic Acids Research, 2002
Predicting transmembrane protein topology with a hidden markov model: application to complete genomes11Edited by F. Cohen
Journal of Molecular Biology, 2001
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
Comparative Protein Modelling by Satisfaction of Spatial Restraints
Journal of Molecular Biology, 1993
Selection of representative protein data sets
Protein Science, 1992