Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing
Open Access
- 5 January 2011
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 12 (1) , 5
- https://doi.org/10.1186/1471-2105-12-5
Abstract
Next-generation sequencing (NGS) offers a unique opportunity for high-throughput genomics and has potential to replace Sanger sequencing in many fields, including de-novo sequencing, re-sequencing, meta-genomics, and characterisation of infectious pathogens, such as viral quasispecies. Although methodologies and software for whole genome assembly and genome variation analysis have been developed and refined for NGS data, reconstructing a viral quasispecies using NGS data remains a challenge. This application would be useful for analysing intra-host evolutionary pathways in relation to immune responses and antiretroviral therapy exposures. Here we introduce a set of formulae for the combinatorial analysis of a quasispecies, given a NGS re-sequencing experiment and an algorithm for quasispecies reconstruction. We require that sequenced fragments are aligned against a reference genome, and that the reference genome is partitioned into a set of sliding windows (amplicons). The reconstruction algorithm is based on combinations of multinomial distributions and is designed to minimise the reconstruction of false variants, called in-silico recombinants. The reconstruction algorithm was applied to error-free simulated data and reconstructed a high percentage of true variants, even at a low genetic diversity, where the chance to obtain in-silico recombinants is high. Results on empirical NGS data from patients infected with hepatitis B virus, confirmed its ability to characterise different viral variants from distinct patients. The combinatorial analysis provided a description of the difficulty to reconstruct a quasispecies, given a determined amplicon partition and a measure of population diversity. The reconstruction algorithm showed good performance both considering simulated data and real data, even in presence of sequencing errors.Keywords
This publication has 34 references indexed in Scilit:
- Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacityNucleic Acids Research, 2009
- Aggressive assembly of pyrosequencing reads with matesBioinformatics, 2008
- Rapid whole-genome mutational profiling using next-generation sequencing technologiesGenome Research, 2008
- Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencingProceedings of the National Academy of Sciences, 2008
- Mapping short DNA sequencing reads and calling variants using mapping quality scoresGenome Research, 2008
- The complete genome of an individual by massively parallel DNA sequencingNature, 2008
- Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplexNature Methods, 2008
- A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexingNucleic Acids Research, 2007
- SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencingGenome Research, 2007
- Characterization of mutation spectra with ultra-deep pyrosequencing: Application to HIV-1 drug resistanceGenome Research, 2007