A semiautomated approach to gene discovery through expressed sequence tag data mining: Discovery of new human transporter genes
- 1 March 2003
- journal article
- research article
- Published by Springer Nature in AAPS PharmSci
- Vol. 5 (1) , 1-18
- https://doi.org/10.1208/ps050101
Abstract
Identification and functional characterization of the genes in the human genome remain a major challenge. A principal source of publicly available information used for this purpose is the National Center for Biotechnology Information database of expressed sequence tags (dbEST), which contains over 4 million human ESTs. To extract the information buried in this data more effectively, we have developed a semiautomated method to mine dbEST for uncharacterized human genes. Starting with a single protein input sequence, a family of related proteins from all species is compiled. This entire family is then used to mine the human EST database for new gene candidates. Evaluation of putative new gene candidates in the context of a family of characterized proteins provides a framework for inference of the structure and function of the new genes. When applied to a test data set of 28 families within the major facilitator superfamily (MFS) of membrane transporters, our protocol found 73 previously characterized human MFS genes and 43 new MFS gene candidates. Development of this approach provided insights into the problems and pitfalls of automated data mining using public databases.Keywords
This publication has 52 references indexed in Scilit:
- An expressed sequence tag (EST) data mining strategy succeeding in the discovery of new G-protein coupled receptors11Edited by J. ThorntonJournal of Molecular Biology, 2001
- Initial sequencing and analysis of the human genomeNature, 2001
- Microbial genome analyses: comparative transport capabilities in eighteen prokaryotes 1 1Edited by G. von HeijneJournal of Molecular Biology, 2000
- A Greedy Algorithm for Aligning DNA SequencesJournal of Computational Biology, 2000
- Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities 1 1Edited by G. Von HeijneJournal of Molecular Biology, 1998
- The Complete Genome Sequence of Escherichia coli K-12Science, 1997
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choiceNucleic Acids Research, 1994
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Abnormal human haemoglobins. III the chemical difference between normal and sickle cell haemoglobinsBiochimica et Biophysica Acta, 1959