Predicting Gene Expression from Sequence: A Reexamination

Abstract
Although much of the information regarding genes' expressions is encoded in the genome, deciphering such information has been very challenging. We reexamined Beer and Tavazoie's (BT) approach to predict mRNA expression patterns of 2,587 genes in Saccharomyces cerevisiae from the information in their respective promoter sequences. Instead of fitting complex Bayesian network models, we trained naïve Bayes classifiers using only the sequence-motif matching scores provided by BT. Our simple models correctly predict expression patterns for 79% of the genes, based on the same criterion and the same cross-validation (CV) procedure as BT, which compares favorably to the 73% accuracy of BT. The fact that our approach did not use position and orientation information of the predicted binding sites but achieved a higher prediction accuracy, motivated us to investigate a few biological predictions made by BT. We found that some of their predictions, especially those related to motif orientations and positions, are at best circumstantial. For example, the combinatorial rules suggested by BT for the PAC and RRPE motifs are not unique to the cluster of genes from which the predictive model was inferred, and there are simpler rules that are statistically more significant than BT's ones. We also show that CV procedure used by BT to estimate their method's prediction accuracy is inappropriate and may have overestimated the prediction accuracy by about 10%. Through binding to certain sequence-specific sites upstream of the target genes, a special class of proteins called transcription factors (TFs) control transcription activities, i.e., expression amounts, of the downstream genes. The DNA sequence patterns bound by TFs are called motifs. It has been shown in an article by Beer and Tavazoie (BT) published in Cell in 2004 that a gene's expression pattern can be well-predicted based only on its upstream sequence information in the form of matching scores of a set of sequence motifs and the location and orientation of corresponding predicted binding sites. Here we report a new naïve Bayes method for such a prediction task. Compared to BT's work, our model is simpler, more robust, and achieves a higher prediction accuracy using only the motif matching score. In our method, the location and orientation information do not further help the prediction in a global way. Our result also casts doubt on several biological hypotheses generated by BT based on their model. Finally, we show that the cross-validation procedure used by BT to estimate their method's prediction accuracy is inappropriate and may have overestimated the accuracy by about 10%.