On the Optimal Design of Genetic Variant Discovery Studies

27 January 2010

journal article
Published by Walter de Gruyter GmbH in Statistical Applications in Genetics and Molecular Biology

Vol. 9 (1) , Article33
https://doi.org/10.2202/1544-6115.1581

Abstract

The recent emergence of massively parallel sequencing technologies has enabled an increasing number of human genome re-sequencing studies, notable among them being the 1000 Genomes Project. The main aim of these studies is to identify the yet unknown genetic variants in a genomic region, mostly low frequency variants (frequency less than 5%). We propose here a set of statistical tools that address how to optimally design such studies in order to increase the number of genetic variants we expect to discover. Within this framework, the tradeoff between lower coverage for more individuals and higher coverage for fewer individuals can be naturally solved. The methods here are also useful for estimating the number of genetic variants missed in a discovery study performed at low coverage. We show applications to simulated data based on coalescent models and to sequence data from the ENCODE project. In particular, we show the extent to which combining data from multiple populations in a discovery study may increase the number of genetic variants identified relative to studies on single populations.

Keywords

This publication has 10 references indexed in Scilit:

The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples
Genetics, 2010
Sequencing technologies — the next generation
Nature Reviews Genetics, 2009
Finding the missing heritability of complex diseases
Nature, 2009
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
Published by Elsevier ,2009
Estimating the number of unseen variants in the human genome
Proceedings of the National Academy of Sciences, 2009
A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic
PLoS Genetics, 2009
Next-generation DNA sequencing
Nature Biotechnology, 2008
Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data
Published by Elsevier ,2008
GENOME: a rapid coalescent-based whole genome simulator
Bioinformatics, 2007
THE GENETICAL STRUCTURE OF POPULATIONS
Annals of Eugenics, 1949