A memetic co-clustering algorithm for gene expression profiles and biological annotation

Abstract
With the invention of microarrays, researchers are capable of measuring thousands of gene expression levels in parallel at various time points of the biological process. To investigate general regulatory mechanisms, biologists cluster genes based on their expression patterns. In this paper, we propose a new memetic co-clustering algorithm for expression profiles, which incorporates a priori knowledge in the form of Gene Ontology information. Ontologies offer a mechanism to capture knowledge in a shareable form that is also processable by computers. The use of this additional annotation information promises to improve biological data analysis and simplifies the identification of processes that are relevant under the measured conditions. I. I NTRODUCTION In the past few years, DNA microarrays have become one of the major tools in the field of gene expression analysis. In contrast to traditional methods, this technology enables the monitoring of expression levels of thousands of genes in parallel (36). Thus, microarrays are a powerful tool helping to understand the underlying regulatory mechanisms of a cell. A problem inherent in the use of DNA arrays is the tremendous amount of data produced, whose analysis itself constitutes a challenge. Several approaches have been applied to analyze microarray data including principal component analysis (35) as well as supervised (12) and unsupervised learning (10), (32), (33). In unsupervised learning, clustering techniques are utilized to extract the gene expression patterns inherent in the data and thus find potentially co-regulated genes. Various methods have been applied, such as self-organizing- maps (SOMs) (32), k-means (33) and hierarchical clustering (10). Evolutionary approaches have also been applied to gene expression data and were shown to be superior to classical clustering algorithms (23), (30). Although the results of all these approaches are useful, one basic problem remains: none of these methods incorporates known biological information. Therefore, biologists are still forced to do a sequential analysis of their data by first clus- tering the expression data alone and afterwards annotating the genes of each cluster by hand and thus incorporating biological information into their models. Such an approach is slow and exhausting and may also result in a suboptimal clustering since information from other resources could often help in resolving ambiguities or avoiding errors caused by linkages based on noisy data or spurious similarities. One major problem of pure clustering methods is that cluster boundaries are often close and may also be arbitrary to some degree. Our work is based on the expectation that the use of the available biological knowledge is essential for the development of powerful automatic methods for the analysis of gene expression data. To our knowledge there are only a few published attempts that make use of additional biological information for the interpretation of gene expression profiles.