SPA: A Probabilistic Algorithm for Spliced Alignment

Open Access

28 April 2006

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Genetics

Vol. 2 (4) , e24
https://doi.org/10.1371/journal.pgen.0020024

Abstract

Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5′ and 3′ ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi. A prerequisite for the identification and analysis of splice variation in the transcriptomes of higher eukaryotes is the very accurate mapping of cDNAs to their genomes. However, current algorithms use ad hoc scoring schemes that cannot correctly trade off the likelihoods of different sequencing errors against the likelihoods of different gene structures. In this paper the authors develop a Bayesian probabilistic approach to cDNA-to-genome mapping that combines explicit models for the prior probabilities of different gene structures with the likelihoods of different sequencing errors. The parameters of these probabilistic models can be estimated automatically from the input such that the mapping procedure is automatically adapted to the organism and sequencing technology of the data under study. The authors implement their approach in a fast mapping algorithm called SPA and apply it to a dataset of human full-length cDNAs and the FANTOM3 dataset of mouse full-length cDNAs. Comparisons with four other mapping algorithms show that SPA produces mappings that are significantly more accurate, with the largest improvements in the mappings of the 5′ and 3′ ends of the cDNAs, and the mappings around splice boundaries. The authors also identify a novel set of putative splice sites in the human dataset.

Keywords

This publication has 21 references indexed in Scilit:

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2007
Combinatorial microRNA target predictions
Nature Genetics, 2005
Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets
Cell, 2005
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2004
Complete sequencing and characterization of 21,243 full-length human cDNAs
Nature Genetics, 2003
The UCSC Genome Browser Database
Nucleic Acids Research, 2003
Splice Variation in Mouse Full-Length cDNAs Identified by Mapping to the Mouse Genome
Genome Research, 2002
The Human Genome Browser at UCSC
Genome Research, 2002
BLAT—The BLAST-Like Alignment Tool
Genome Research, 2002
Prediction of complete gene structures in human genomic DNA
Journal of Molecular Biology, 1997