Using Text Analysis to Identify Functionally Coherent Gene Groups
Open Access
- 1 October 2002
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 12 (10) , 1582-1590
- https://doi.org/10.1101/gr.116402
Abstract
The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method,neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how “functionally coherent” the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.Keywords
This publication has 31 references indexed in Scilit:
- Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical LiteratureGenome Research, 2002
- A literature network of human genes for high-throughput analysis of gene expressionNature Genetics, 2001
- Functional Discovery via a Compendium of Expression ProfilesCell, 2000
- A novel method for automatic functional annotation of proteins.Bioinformatics, 1999
- SGD: Saccharomyces Genome DatabaseNucleic Acids Research, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Resolution of Subunit Interactions and Cytoplasmic Subcomplexes of the Yeast Vacuolar Proton-translocating ATPasePublished by Elsevier ,1996
- Structural and functional analyses of APG5 a gene involved in autophagy in yeastGene, 1996
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Basic local alignment search toolJournal of Molecular Biology, 1990