Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data

  • 9 February 2009
Abstract
Recent publications have described and used a novel metric which quantifies the genetic distance of an individual with respect to two population pools, and have suggested that the metric makes it possible to infer the presence of an individual of known genotype in a population for which only the marginal allele frequencies are known. However, the assumptions, limitations, and utility of this metric remain incompletely characterized. Here we present an exploration of the power and limitations of that method. Using real and simulated genotypes, we test the methods' efficacy and sensitivity to the strength of the underlying assumptions. The results reveal that when used as a means by which to identify individuals as members of one of the two population groups its specificity is low in several circumstances. We find that the misclassifications stem from violations of assumptions that are crucial to the technique yet hard to control in practice, and additionally that the specificity may still be low even in ideal circumstances if the individual in question strongly resembles a true positive. However, despite the metric's inadequacies for identifying the presence of an individual, we show that it may have utility for revealing genetic similarity of an unseen individual to known groups, and may thus have some potential for inferring ancestry or predicting an individual's propensity to disease. By revealing both the power and limitations of the proposed method, we hope to elucidate situations in which this distance metric may be used in an appropriate manner. We also discuss the implications of the false-positive rate as it impacts the method's use in forensics and GWAS participant privacy.

This publication has 0 references indexed in Scilit: