Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature
Open Access
- 1 January 2002
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 12 (1) , 203-214
- https://doi.org/10.1101/gr.199701
Abstract
Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.Keywords
This publication has 37 references indexed in Scilit:
- Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid SequenceJournal of Molecular Biology, 2000
- The Genome Sequence of Drosophila melanogasterScience, 2000
- Multiple Independent Defective Suppressor-mutator Transposon Insertions in Arabidopsis: A Tool for Functional GenomicsPlant Cell, 1999
- Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequenceNature, 1998
- SGD: Saccharomyces Genome DatabaseNucleic Acids Research, 1998
- Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA MicroarrayScience, 1995
- Hidden Markov Models in Computational BiologyJournal of Molecular Biology, 1994
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Selection of Medline contents, the development of its thesaurus, and the indexing processMedical Informatics, 1978