A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures
Open Access
- 1 July 1999
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 27 (13) , 2627-2637
- https://doi.org/10.1093/nar/27.13.2627
Abstract
A clean data set of verified splice sites from Homo sapiens are reported as well as the standards used for the clean-up procedure. The sites were validated by: (i) standard cleaning procedures such as requiring consistency in the annotation of the gene structural elements, completeness of the coding regions and elimination of redundant sequences; (ii) clustering by decision trees coupled with analysis of ClustalW alignments of the translated protein sequence with homologous proteins from SWISS-PROT; (iii) matching against human EST sequences. The sites are categorised as: (i) donor sites, a set of 619 EST-confirmed donor sites, for which 138 are either the sites or the regions around the sites involved in alternative splice events; (ii) acceptor sites, a set of 623 EST-confirmed acceptor sites, for which 144 are either the sites or the regions around the sites are involved in alternative splice events; (iii) genuine splice sites, a set of 392 splice sites wherein both the donor and acceptor sites had EST confirmation and were not involved in any alternative splicing; (iv) alternative splice sites, a set of 209 splice sites wherein both the donor and acceptor sites had EST confirmation and the sites or the regions around them were involved in alternative splicing. A set of nucleotide regions that can be used to generate a control set of false splice sites that have a high confidence of being nonfunctional are also reported.Keywords
This publication has 8 references indexed in Scilit:
- Finding the genes in genomic DNAPublished by Elsevier ,2002
- The EMBL Nucleotide Sequence DatabaseNucleic Acids Research, 1999
- The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998Nucleic Acids Research, 1998
- A comparison of expressed sequence tags (ESTs) to human genomic sequences.Nucleic Acids Research, 1997
- Evaluation of Gene Structure Prediction ProgramsGenomics, 1996
- Cleaning the GenBank Arabidopsis thaliana data setNucleic Acids Research, 1996
- CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choiceNucleic Acids Research, 1994
- Improved tools for biological sequence comparison.Proceedings of the National Academy of Sciences, 1988