Tracing Sub-Structure in the European American Population with PCA-Informative Markers

Abstract
Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals–307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150–200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs. Genetic association studies search to identify disease susceptibility genes through the analysis of genetic markers such as single nucleotide polymorphisms (SNPs) in large numbers of cases and controls. In such settings, the existence of sub-structure in the population under study (i.e. differences in ancestry among cases and controls) may lead to spurious results. It is therefore imperative to control for this possible bias. Such biases may arise for example when studying the European American population, which consists of individuals of diverse ancestry proportions from different European countries and to some degree also from African and Native American populations. Here, we study the genetic sub-structure of the European American population, analyzing 1,521 individuals for over 300,000 SNPs across the entire genome. Applying a powerful method that is based on dimensionality reduction (Principal Components Analysis), we are able to identify 200 SNPs that successfully represent the complete dataset. Importantly, we introduce a novel method that effectively removes redundancy from any set of genetic markers, and may prove extremely useful in a variety of different research scenarios, in order to significantly reduce the cost of a study.