Large-Scale Clustering of cDNA-Fingerprinting Data
Open Access
- 1 November 1999
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 9 (11) , 1093-1105
- https://doi.org/10.1101/gr.9.11.1093
Abstract
Clustering is one of the main mathematical challenges in large-scale gene expression analysis. We describe a clustering procedure based on a sequential k-means algorithm with additional refinements that is able to handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. The practical motivation for our algorithm is oligonucleotide fingerprinting—a method for simultaneous determination of expression level for every active gene of a specific tissue—although the algorithm can be applied as well to other large-scale projects like EST clustering and qualitative clustering of DNA-chip data. As a pairwise similarity measure between two p-dimensional data points,x and y, we introduce mutual information that can be interpreted as the amount of information about x iny, and vice versa. We show that for our purposes this measure is superior to commonly used metric distances, for example, Euclidean distance. We also introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The performance of our algorithm with respect to experimental noise is shown by extensive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes from a cDNA library derived from human dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.Keywords
This publication has 24 references indexed in Scilit:
- Toward the Gene Catalogue of Sea Urchin Development: The Construction and Analysis of an Unfertilized Egg cDNA Library Highly Normalized by Oligonucleotide FingerprintingGenomics, 1999
- [13] Construction and analysis of arrayed cDNA librariesPublished by Elsevier ,1999
- High-density cDNA Grids for hybridization fingerprinting experimentsPublished by Elsevier ,1999
- Gene-Representing cDNA Clusters Defined by Hybridization of 57,419 Clones from Infant Brain Libraries with Short Oligonucleotide ProbesGenomics, 1996
- Discovering distinct genes represented in 29,570 clones from infant brain cDNA libraries by applying sequencing by hybridization methodology.Genome Research, 1996
- Clone Clustering by HybridizationGenomics, 1995
- Application of robotic technology to automated sequence fingerprint analysis by oligonucleotide hybridisationJournal of Biotechnology, 1994
- Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA libraryNature Genetics, 1993
- An automated approach to generating expressed sequence cataloguesNature, 1993
- Hybridization analyses of arrayed cDNA librariesTrends in Genetics, 1991