Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm

1 January 1995

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 2 (3) , 473-485
https://doi.org/10.1089/cmb.1995.2.473

Abstract

Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

Keywords

This publication has 12 references indexed in Scilit:

Comparison of the predicted and observed secondary structure of T4 phage lysozyme
Published by Elsevier ,2003
Decision trees for automated identification of cosmic-ray hits in Hubble Space Telescope images
Publications of the Astronomical Society of the Pacific, 1995
A System for Induction of Oblique Decision Trees
Journal of Artificial Intelligence Research, 1994
Assessment of protein coding measures
Nucleic Acids Research, 1992
Determination of eukaryotic protein coding regions using neural networks and information theory
Journal of Molecular Biology, 1992
Induction of decision trees
Machine Learning, 1986
A measure of DNA periodicity
Journal of Theoretical Biology, 1986
Delineation of Coding Areas in DNA Sequences Through Assignment of Codon Probabilities
Journal of Biomolecular Structure and Dynamics, 1985
A prevalent persistent global nonrandomness that distinguishes coding and non-coding eucaryotic nuclear DNA sequences
Journal of Molecular Evolution, 1983
Recognition of protein coding regions in DNA sequences
Nucleic Acids Research, 1982