Abstract
There is strong evidence that rare variants are involved in complex disease etiology. The first step in implicating rare variants in disease etiology is their identification through sequencing in both randomly ascertained samples (e.g., the 1,000 Genomes Project) and samples ascertained according to disease status. We investigated to what extent rare variants will be observed across the genome and in candidate genes in randomly ascertained samples, the magnitude of variant enrichment in diseased individuals, and biases that can occur due to how variants are discovered. Although sequencing cases can enrich for casual variants, when a gene or genes are not involved in disease etiology, limiting variant discovery to cases can lead to association studies with dramatically inflated false positive rates. One focus of human genetics is localizing genes that are involved in the etiology of complex diseases. Although emphasis has been placed on mapping common variants, recent studies have demonstrated that rare variants also play an important role in complex trait etiology and their identification should have a greater impact on risk assessment, disease prevention, and treatment due to their large genetic effects. Genome-wide association studies are used to identify common variants by genotyping tagSNPs that are proxies for common causal variants. This study design is not adequately powered for association studies of rare variants; instead, causal variants must be identified and then analyzed. With the development of sequencing technologies, it is feasible to sequence candidate genes and, soon, entire genomes to obtain data on rare variants for complex disease association studies. We investigated several questions that are germane to the discovery of rare variants within a sample; for example, proportion of variants discovered within a random sample and enrichment of causal variants within samples of cases compared to a random sample. We also demonstrate that when an excess of cases are sequenced to discover variants and the remaining samples are genotyped, this design strategy can lead to inflated false positive rates.