Habitat-Lite: A GSC Case Study Based on Free Text Terms for Environmental Metadata
- 1 June 2008
- journal article
- research article
- Published by Mary Ann Liebert Inc in OMICS: A Journal of Integrative Biology
- Vol. 12 (2) , 129-136
- https://doi.org/10.1089/omi.2008.0016
Abstract
There is an urgent need to capture metadata on the rapidly growing number of genomic, metagenomic and related sequences, such as 16S ribosomal genes. This need is a major focus within the Genomic Standards Consortium (GSC), and Habitat is a key metadata descriptor in the proposed “Minimum Information about a Genome Sequence” (MIGS) specification. The goal of the work described here is to provide a light-weight, easy-to-use (small) set of terms (“Habitat-Lite”) that captures high-level information about habitat while preserving a mapping to the recently launched Environment Ontology (EnvO). Our motivation for building Habitat-Lite is to meet the needs of multiple users, such as annotators curating these data, database providers hosting the data, and biologists and bioinformaticians alike who need to search and employ such data in comparative analyses. Here, we report a case study based on semiautomated identification of terms from GenBank and GOLD. We estimate that the terms in the initial version of Habitat-Lite would provide useful labels for over 60% of the kinds of information found in the GenBank isolation_source field, and around 85% of the terms in the GOLD habitat field. We present a revised version of Habitat-Lite defined within the EnvO Environmental Ontology through a new category, EnvO-Lite-GSC. We invite the community's feedback on its further development to provide a minimum list of terms to capture high-level habitat information and to provide classification bins needed for future studies.Keywords
This publication has 14 references indexed in Scilit:
- The minimum information about a genome sequence (MIGS) specificationNature Biotechnology, 2008
- The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadataNucleic Acids Research, 2007
- The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensionsNucleic Acids Research, 2007
- IMG/M: a data management and analysis system for metagenomesNucleic Acids Research, 2007
- The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public dataNucleic Acids Research, 2006
- Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARBApplied and Environmental Microbiology, 2006
- Metagenomic Analysis of the Human Distal Gut MicrobiomeScience, 2006
- Ecological perspectives on the sequenced genome collectionEcology Letters, 2005
- Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sourcesIEEE Transactions on Knowledge and Data Engineering, 2005
- Captured Diversity in a Culture Collection: Case Study of the Geographic and Habitat Distributions of Environmental Isolates Held at the American Type Culture CollectionApplied and Environmental Microbiology, 2005