SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples
- 27 October 2010
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 21 (6) , 952-960
- https://doi.org/10.1101/gr.113084.110
Abstract
Reductions in the cost of sequencing have enabled whole-genome sequencing to identify sequence variants segregating in a population. An efficient approach is to sequence many samples at low coverage, then to combine data across samples to detect shared variants. Here, we present methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. For each population, we first collect SNP candidates based on independent sequence calls per site. We then use MARGARITA with genotype or phased haplotype data from the same samples to collect 20 ancestral recombination graphs (ARGs). We refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. Using a population genetic prior distribution on tree-branch length and Bayesian inference, we determine a posterior probability of the SNP being real and also the most probable phased genotype call for each individual. We present experiments on both simulation data and real data from the 1000 Genomes Project to prove the applicability of the methods. We also explore the relative tradeoff between sequencing depth and the number of sequenced samples.Keywords
This publication has 21 references indexed in Scilit:
- MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypesGenetic Epidemiology, 2010
- A map of human genome variation from population-scale sequencingNature, 2010
- Integrating common and rare genetic variation in diverse human populationsNature, 2010
- Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association StudiesAmerican Journal of Human Genetics, 2009
- Exome sequencing identifies the cause of a mendelian disorderNature Genetics, 2009
- A highly annotated whole-genome sequence of a Korean individualNature, 2009
- Population genomics of domestic and wild yeastsNature, 2009
- The diploid genome sequence of an Asian individualNature, 2008
- The complete genome of an individual by massively parallel DNA sequencingNature, 2008
- Permutation‐based adjustments for the significance of partial regression coefficients in microarray data analysisGenetic Epidemiology, 2007