Identifying Personal Genomes by Surname Inference

Abstract
Anonymity Compromised: The balance between maintaining individual privacy and sharing genomic information for research purposes has been a topic of considerable controversy. Gymrek et al. (p. 321 ; see the Policy Forum by Rodriguez et al. ) demonstrate that the anonymity of participants (and their families) can be compromised by analyzing Y-chromosome sequences from public genetic genealogy Web sites that contain (sometimes distant) relatives with the same surname. Short tandem repeats (STRs) on the Y chromosome of a target individual (whose sequence was freely available and identified in GenBank) were compared with information in public genealogy Web sites to determine the shortest time to the most recent common ancestor and find the most likely surname, which, when combined with age and state of residency identified the individual. When STRs from 911 individuals were used as the starting points, the analysis projected a success rate of 12% within the U.S. male population with Caucasian ancestry. Further analysis of detailed pedigrees from one collection revealed that families of individuals whose genomes are in public repositories could be identified with high probability.