Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery

Abstract
Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed sequences tags (ESTs) have recently been added to the NCBI databases. We matched the transcript-sequences of RefSeq, Ensembl and dbEST in an attempt to provide an updated overview of how many unique human genes can be found. The results indicate that there are about 25 000 unique genes in the union of RefSeq and Ensembl with 12–18% and 8–13% of the genes in each set unique to the other set, respectively. About 20% of all genes had splice variants. There are a considerable number of ESTs (2 200 000) that do not match the identified genes and we used an in-house pipeline to identify 22 novel genes from Genscan predictions that have considerable EST coverage. The study provides an insight into the current status of human gene catalogues and shows that considerable refinement of methods and datasets is needed to come to a conclusive gene count. Keywords Eexpressed sequences tag RefSeq Ensembl Databases Genscan 1 Introduction One of the most intriguing questions in human biology is the number and identity of the human genes. Three years after the release of the genomic sequence [1,2] , there are still large uncertainties about the exact number of human genes. The number of known and hypothetical genes in current databases such as RefSeq and Ensembl is continuously growing but is still less than the predicted number of genes [1–8] , indicating that these datasets are incomplete. Moreover, there are significant differences between these datasets. Identification and annotation of human genes is likely to continue to be an important issue within the field of genetic research in several years to come. Historically, the highest estimates for the human gene count have been based on clustering of expressed sequences tags (ESTs) and many uncertainties with such methods have been pointed out. Fields and colleagues used a method based on similarity between known cDNA sequences and ESTs to calculate an approximate number of human genes [3] . They estimated that if clustered at high stringency, the resulting EST-clusters would represent about 50% of all genes, which from their sim35 000 clusters would correspond to about 60 000–70 000 genes. Davidson and Burke [4] used a similar method based on clustering human EST sequences into transcription units to reach a number of about 70 000 genes, with between 1.2 and 1.5 different transcripts per gene. Both of these methods suffered from the fact that normalization of cDNA libraries may enrich for contaminants and aberrant clones. The problem of genomic contamination has been estimated to affect 5–8% of all ESTs [9,10] . Moreover, the former method included singleton clusters, which may also cause overestimation of the total number of genes. Another high number was reached by extrapolating from correlation between human genes and genomic CpG islands. It was estimated that half of the human genes were accompanied by a neighboring region of above average CpG dinucleotide content. The number of CpG islands was calculated to around 40 000, which led to an estimate of 80 000 human genes [5] . This early estimate assumed that each CpG-island corresponds to a unique gene, while a later analysis of chromosome 22 showed that this is only true for about 60% of the islands. The public sequencing project reported the existence of only about 29 000 islands in non-repeatmasked areas in the analysis of the finished genomic sequence. With these data, the estimate of this method would be lowered to about 35 000 human genes, a number not far from the currently popular quote derived from both sequencing projects [1,2] . Predictions made by ab initio programs such as Genscan [11] have proven to be rich source of novel genes. In 2001, the false positive rate for Genscan was estimated on chromosome 22 using microarray analysis. This study concluded a false positive fraction of 17% compared to the original estimate by the chromosome 22 sequencing group of 27.5% [12,13] . Using this assumption and predicting that the rate is constant over the other chromosomes, this would mean that 7368 (0.17 × 43 000) genes from the whole Genscan set of 43 000 genes are false predictions. The remaining set of about 35 600 human predictions is significantly larger than any known gene set (RefSeq, Ensembl, SWISS-PROT, TrEMBL), of which the Ensembl set is the largest, currently containing 29 802 human transcripts. Evidence of actual transcription of predicted genes can be obtained by identifying mRNA and EST sequences matching the predictions. The growth of dbEST from 2 million sequences in 2001 to over 5 million today means that the likelihood of finding transcript evidence for novel predictions is better than ever. Since 2002, the University of California Santa Cruz (UCSC) has provided data on genomic alignments for all human ESTs, providing means to more specifically match ESTs with gene predictions based on genomic location rather than sequence similarity. Cross matching ab initio predictions such as Genscan predictions with ESTs remains thus as an important method to verify gene predictions and identify new genes. Creation of a complete set of known and hypothetical genes requires merging of existing datasets into a non-redundant set of genes. Such merging is often based on a sequence similarity threshold [14,15] that is prone to misjudgements, since it is difficult to find a threshold that always gives the correct result. Another method of merging sets of transcripts is to align all sequences to the genome and determine which sequences from each set that are not overlapped by any sequence form the other set. However, for transcript-sequences that are not defined directly from the genomic sequence, it can be a problem to find good quality genomic alignments. For example, the UCSC table refGene, which contains positions for RefSeq transcripts as...