High-precision high-coverage functional inference from integrated data sources

Open Access

25 February 2008

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 9 (1) , 119
https://doi.org/10.1186/1471-2105-9-119

Abstract

Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.

Keywords

This publication has 57 references indexed in Scilit:

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2007
Network‐based prediction of protein function
Molecular Systems Biology, 2007
Græmlin: General and robust alignment of multiple large interaction networks
Genome Research, 2006
A framework of integrating gene relations from heterogeneous data sources: an experiment onArabidopsis thaliana
Bioinformatics, 2006
VIRGO: computational prediction of gene functions
Nucleic Acids Research, 2006
Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing Positional Candidate Genes
American Journal of Human Genetics, 2006
A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
BMC Bioinformatics, 2006
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2004
Global protein function prediction from protein-protein interaction networks
Nature Biotechnology, 2003
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proceedings of the National Academy of Sciences, 2001