An MCMC algorithm for haplotype assembly from whole-genome sequence data

1 August 2008

journal article
research article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 18 (8) , 1336-1346
https://doi.org/10.1101/gr.077065.108

Abstract

In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ∼ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ∼1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from http://www.cse.ucsd.edu/users/vibansal/HASH/.

Keywords

This publication has 44 references indexed in Scilit:

The complete genome of an individual by massively parallel DNA sequencing
Nature, 2008
Next-Generation Sequencing: The Race Is On
Cell, 2008
Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures
Proceedings of the National Academy of Sciences, 2008
A second generation human haplotype map of over 3.1 million SNPs
Nature, 2007
Genome-wide detection and characterization of positive selection in human populations
Nature, 2007
Genome-wide association study identifies novel breast cancer susceptibility loci
Nature, 2007
A new multipoint method for genome-wide association studies by imputation of genotypes
Nature Genetics, 2007
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls
Nature, 2007
A genome-wide association study identifies novel risk loci for type 2 diabetes
Nature, 2007
A haplotype map of the human genome
Nature, 2005