Probabilistic base calling of Solexa sequencing data
Open Access
- 13 October 2008
- journal article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 9 (1) , 431
- https://doi.org/10.1186/1471-2105-9-431
Abstract
Background: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. Results: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. Conclusion: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.Keywords
This publication has 29 references indexed in Scilit:
- Substantial biases in ultra-short read data sets from high-throughput DNA sequencingNucleic Acids Research, 2008
- Alta-Cyclic: a self-optimizing base caller for next-generation sequencingNature Methods, 2008
- Bioinformatics challenges of new sequencing technologyPublished by Elsevier ,2008
- De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computerGenome Research, 2008
- Mapping translocation breakpoints by next-generation sequencingGenome Research, 2008
- Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterningNature, 2008
- Identification of microRNAs and other small regulatory RNAs using cDNA library sequencingMethods, 2007
- Genome-wide maps of chromatin state in pluripotent and lineage-committed cellsNature, 2007
- High-Resolution Profiling of Histone Methylations in the Human GenomePublished by Elsevier ,2007
- Robust Locally Weighted Regression and Smoothing ScatterplotsJournal of the American Statistical Association, 1979