A Semi-Supervised Method for Predicting Transcription Factor–Gene Interactions in Escherichia coli

Abstract
While Escherichia coli has one of the most comprehensive datasets of experimentally verified transcriptional regulatory interactions of any organism, it is still far from complete. This presents a problem when trying to combine gene expression and regulatory interactions to model transcriptional regulatory networks. Using the available regulatory interactions to predict new interactions may lead to better coverage and more accurate models. Here, we develop SEREND (SEmi-supervised REgulatory Network Discoverer), a semi-supervised learning method that uses a curated database of verified transcriptional factor–gene interactions, DNA sequence binding motifs, and a compendium of gene expression data in order to make thousands of new predictions about transcription factor–gene interactions, including whether the transcription factor activates or represses the gene. Using genome-wide binding datasets for several transcription factors, we demonstrate that our semi-supervised classification strategy improves the prediction of targets for a given transcription factor. To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli. We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response. The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobic–anaerobic shift interface. The proper functioning of transcriptional gene regulation is essential for all living organisms. Several diseases are associated with loss of appropriate transcriptional regulation. Even in relatively simple organisms, such as the bacterium E. coli, response to environmental stress is a complex and highly regulated process. This process is controlled by a set of transcription factors that causes an increase or decrease in the expression levels of their target's gene. However, identifying the set of targets regulated by each of these factors remains a challenge. Even after decades of experimental research on E. coli, only a quarter of all gene products have a known regulator. Here, we develop a method that extends the known set of regulator–target relationships with additional predictions. Our method utilizes the DNA sequence control code and expression levels of known targets in a variety of conditions, as well as genes for which it is not known if they are targets of a specific regulator. We show that our method more accurately identifies true targets of known regulators than previous methods suggested for this task. We then applied our predictions to identify active regulators involved in the dynamic response that occurs in E. coli when it is deprived of oxygen.