Accurate prediction of protein–protein interactions from sequence alignments using a Bayesian method

Abstract
Accurate and large‐scale prediction of protein–protein interactions directly from amino‐acid sequences is one of the great challenges in computational biology. Here we present a new Bayesian network method that predicts interaction partners using only multiple alignments of amino‐acid sequences of interacting protein domains, without tunable parameters, and without the need for any training examples. We first apply the method to bacterial two‐component systems and comprehensively reconstruct two‐component signaling networks across all sequenced bacteria. Comparisons of our predictions with known interactions show that our method infers interaction partners genome‐wide with high accuracy. To demonstrate the general applicability of our method we show that it also accurately predicts interaction partners in a recent dataset of polyketide synthases. Analysis of the predicted genome‐wide two‐component signaling networks shows that cognates (interacting kinase/regulator pairs, which lie adjacent on the genome) and orphans (which lie isolated) form two relatively independent components of the signaling network in each genome. In addition, while most genes are predicted to have only a small number of interaction partners, we find that 10% of orphans form a separate class of ‘hub’ nodes that distribute and integrate signals to and from up to tens of different interaction partners. ### Synopsis A method that comprehensively and accurately predicts protein–protein interactions using only the amino‐acid sequences of proteins would essentially allow the reconstruction of genome‐wide protein interaction networks directly from genome sequences. Automated prediction of protein–protein interactions from their amino‐acid sequences is therefore one of the great outstanding challenges in computational biology. Numerous approaches have already been proposed which, however, all suffer from serious drawbacks. Some methods only infer functional relationships and not direct physical interactions, while others are difficult to scale up to large data sets and most of them suffer from high false positive rates (see [Valencia and Pazos, 2002][1]; [Bork et al , 2004][2]; [Shoemaker and Panchenko, 2007][3] for reviews). The new method presented here operates on sets of protein families for which it is known that members from one family interact with members from one or more of the other families. Multiple alignments of the sequences in each family are constructed and the algorithm searches over all possible ways in which the proteins from different families can be paired up to form interacting pairs (see [Figure 1][4]). The best assignment of interacting pairs is roughly speaking the one that maximizes the statistical dependencies that are observed between amino acids of the interacting protein pairs. Although ‘correlated mutations’ have been used previously to infer protein–protein interactions ([Pazos and Valencia, 2002][5]), the current method presents several important methodological advances. A crucial ingredient of the method is that a rigorous Bayesian network framework is used to derive the probability of each possible assignment from first principles, that is, without any tunable parameters. Also, instead of attempting to infer which residues interact, the method sums over all possible ways in which a tree of dependencies can be assigned to pairs of residues both within and between the interacting proteins. Additionally, the model does not require any training sets, but predicts interactions ab initio by searching the space of all possible ways in which proteins can be paired up to form interaction partners. Finally, our model assigns interaction partners for all proteins from multiple genomes in parallel, thereby maximizing the algorithm's ability to detect subtle sequence dependencies, and uses a Markov chain Monte‐Carlo sampling technique to automatically obtain a measure of the reliability of each prediction. Application to two sets of interacting bacterial protein families are presented in the paper: two component systems (TCSs) and polyketide synthases (PKSs). TCSs are responsible for most of the signal transduction underlying complex bacterial behaviors ([Stock et al , 2000][6]). They typically consist of a membrane‐bound sensory protein with an intracellular histidine kinase domain that, upon activation, specifically phosphorylates a receiver domain on a regulator protein, which in turn is typically activated by this phosphorylation. The method was applied to comprehensively reconstruct two‐component signaling networks across all sequenced bacteria involving thousands of kinase and regulator proteins. Comparisons of the predictions with known interactions show that the method infers interaction partners genome‐wide with high accuracy. The second application, to a data‐set of PKSs ([Thattai et al , 2007][7]), demonstrates the generality of the method, and shows that it can also accurately predict interaction partners in much smaller data sets. Analysis of the predicted genome‐wide two‐component signaling networks reveals several interesting features. Two‐component system genes can be divided into two classes. Cognates are interacting kinase/regulator pairs that lie adjacent on the genome and are co‐transcribed, and orphans are kinases and regulators that lie isolated on the genome. The analysis shows that cognates and orphans form two relatively independent groups, with cognates interacting predominantly with cognates and orphans predominantly with orphans. Second, we find a difference in interaction propensity of kinases and regulators of the two groups. Whereas most kinases and regulators interact with only a few partners, about 10% interact with a large number of partners. Most of these ‘hub’ kinases and regulators...