An MCMC algorithm for haplotype assembly from whole-genome sequence data
- 1 August 2008
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 18 (8) , 1336-1346
- https://doi.org/10.1101/gr.077065.108
Abstract
In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ∼ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ∼1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from http://www.cse.ucsd.edu/users/vibansal/HASH/.Keywords
This publication has 44 references indexed in Scilit:
- The complete genome of an individual by massively parallel DNA sequencingNature, 2008
- Next-Generation Sequencing: The Race Is OnCell, 2008
- Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructuresProceedings of the National Academy of Sciences, 2008
- A second generation human haplotype map of over 3.1 million SNPsNature, 2007
- Genome-wide detection and characterization of positive selection in human populationsNature, 2007
- Genome-wide association study identifies novel breast cancer susceptibility lociNature, 2007
- A new multipoint method for genome-wide association studies by imputation of genotypesNature Genetics, 2007
- Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controlsNature, 2007
- A genome-wide association study identifies novel risk loci for type 2 diabetesNature, 2007
- A haplotype map of the human genomeNature, 2005