Tracing Sub-Structure in the European American Population with PCA-Informative Markers
Open Access
- 4 July 2008
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Genetics
- Vol. 4 (7) , e1000114
- https://doi.org/10.1371/journal.pgen.1000114
Abstract
Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals–307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150–200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs. Genetic association studies search to identify disease susceptibility genes through the analysis of genetic markers such as single nucleotide polymorphisms (SNPs) in large numbers of cases and controls. In such settings, the existence of sub-structure in the population under study (i.e. differences in ancestry among cases and controls) may lead to spurious results. It is therefore imperative to control for this possible bias. Such biases may arise for example when studying the European American population, which consists of individuals of diverse ancestry proportions from different European countries and to some degree also from African and Native American populations. Here, we study the genetic sub-structure of the European American population, analyzing 1,521 individuals for over 300,000 SNPs across the entire genome. Applying a powerful method that is based on dimensionality reduction (Principal Components Analysis), we are able to identify 200 SNPs that successfully represent the complete dataset. Importantly, we introduce a novel method that effectively removes redundancy from any set of genetic markers, and may prove extremely useful in a variety of different research scenarios, in order to significantly reduce the cost of a study.Keywords
This publication has 56 references indexed in Scilit:
- Polymorphisms of the HNF1A Gene Encoding Hepatocyte Nuclear Factor-1α are Associated with C-Reactive ProteinAmerican Journal of Human Genetics, 2008
- Simultaneously Correcting for Population Stratification and for Genotyping Error in Case-Control Association StudiesAmerican Journal of Human Genetics, 2007
- Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controlsNature, 2007
- A Simple and Improved Correction for Population Stratification in Case-Control StudiesAmerican Journal of Human Genetics, 2007
- Measuring European Population Stratification with Microarray Genotype DataAmerican Journal of Human Genetics, 2007
- A genome-wide association study identifies novel risk loci for type 2 diabetesNature, 2007
- Intra- and interpopulation genotype reconstruction from tagging SNPsGenome Research, 2006
- Principal components analysis corrects for stratification in genome-wide association studiesNature Genetics, 2006
- A haplotype map of the human genomeNature, 2005
- The International HapMap ProjectNature, 2003