Enhancing navigation in biomedical databases by community voting and database-driven text classification
Open Access
- 3 October 2009
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 10 (1) , 317
- https://doi.org/10.1186/1471-2105-10-317
Abstract
The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at http://pepbank.mgh.harvard.edu .Keywords
This publication has 59 references indexed in Scilit:
- Integrating and annotating the interactome using the MiMI plugin for cytoscapeBioinformatics, 2008
- Overview of the protein-protein interaction annotation extraction task of BioCreative IIGenome Biology, 2008
- Calling on a million minds for community annotation in WikiProteinsGenome Biology, 2008
- ORegAnno: an open-access community-driven resource for regulatory annotationNucleic Acids Research, 2007
- IDBD: Infectious Disease Biomarker DatabaseNucleic Acids Research, 2007
- PepBank - a database of peptides based on sequence text mining and public peptide data sourcesBMC Bioinformatics, 2007
- Automating document classification for the Immune Epitope DatabaseBMC Bioinformatics, 2007
- Entrez Gene: gene-centered information at NCBINucleic Acids Research, 2006
- The FlyBase database of the Drosophila genome projects and community literatureNucleic Acids Research, 2003
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997