Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines

Open Access

28 April 2006

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Genetics

Vol. 2 (4) , e29-536
https://doi.org/10.1371/journal.pgen.0020029

Abstract

RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000. There are two types of RNA: messenger RNAs (mRNAs), which are translated into proteins, and non-coding RNAs (ncRNAs), which function as RNA molecules. Besides textbook examples such as tRNAs and rRNAs, non-coding RNAs have been found to carry out very diverse functions, from mRNA splicing and RNA modification to translational regulation. It has been estimated that non-coding RNAs make up the vast majority of transcription output of higher eukaryotes. Discriminating mRNA from ncRNA has become an important biological and computational problem. The authors describe a computational method based on a machine learning algorithm known as a support vector machine (SVM) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, secondary structure content, and protein alignment information. The method is applied to the dataset from the FANTOM3 large-scale mouse cDNA sequencing project; it identifies over 14,000 ncRNAs in mouse and estimates the total number of ncRNAs in the FANTOM3 data to be about 28,000.

Keywords

This publication has 41 references indexed in Scilit:

Mimicking Cellular Sorting Improves Prediction of Subcellular Localization
Journal of Molecular Biology, 2005
Mouse‐centric comparative transcriptomics of protein coding and non‐coding RNAs
BioEssays, 2004
The Pfam protein families database
Nucleic Acids Research, 2004
CDS Annotation in Full-Length cDNA Sequence
Genome Research, 2003
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic Acids Research, 2003
Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs
Nature, 2002
Non–coding RNA genes and the modern RNA world
Nature Reviews Genetics, 2001
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure
Journal of Molecular Biology, 2001
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
A simple method for displaying the hydropathic character of a protein
Journal of Molecular Biology, 1982