Sharing genetic risk scores can unknowingly reveal secrets

Genetic data can be analyzed to estimate the risk of certain conditions

Science Photo Library / Alamy

Genetic risk scores, which summarize a person’s likelihood of developing certain health conditions, can be used using mathematical tricks to reveal hidden details about their DNA.

In theory, health insurance companies could use the method to reconstruct genetic data from a summary genomic report revealing health risks that the patient did not disclose. Alternatively, people sharing their scores can be identified anonymously by extracting genetic data and querying public genealogical databases.

The polygenic risk score measures the impact of tens to thousands of individual letter variations in the genome, known as single nucleotide polymorphisms (SNPs). Scores used by researchers and DNA testing companies like 23andMe to summarize potential health risks are sometimes shared publicly, for example by people asking for advice on how to interpret their scores.

Unraveling the polygenic risk score is like trying to figure out a phone number knowing that the sum of the digits is 52. It is an example of a knapsack problem in mathematics that is known to be computationally difficult. For this reason, the scores are considered a low privacy risk.

However, each SNP value used in the risk score is multiplied by an extremely accurate weight—up to 16 digits—that reflects its contribution to the overall risk of the disease. This makes small risk models vulnerable to attack.

“Because the final polygenic risk score is limited by the finite number of ways to arrive at that number and the statistically likely arrangement of the underlying SNPs, it can be derived with a high degree of precision,” he says. Gamze Gursoy at Columbia University in New York.

Gursoy a Kirill Nikitin, also ran 298 polygenic risk models using 50 or fewer SNPs on genetic data from 2353 individuals in Columbia. They successively counted all the possible genomes that could have produced each given score, and filtered out those with many unusual mutations.

Since a single SNP can be used for multiple polygenic risk models, Gürsoy and Nikitin were able to chain their attack using SNPs revealed by the smaller models to help solve the larger ones.

They were able to reconstruct the donor’s genotype with 94.6 percent accuracy, correctly predicting 2,450 SNPs per individual. Tests showed that 27 SNPs were enough to identify an individual in a pool of half a million samples, and family members could be predicted with up to 90 percent accuracy. Individuals of African and East Asian descent were more easily identified because they are less well represented in genetic databases.

According to Gürsoy, 447 small, high-precision models in public database of polygenic scores are vulnerable to this attack.

“We wanted to point out that the risk is low, but low [some conditions]there can still be some leakage,” says Gürsoy.

Ying Wang of Massachusetts General Hospital says that existing data protection and computational bottlenecks limit the risk of polygenic risk scores being misused in this way. “The results may serve as a warning that small models should be treated as potentially sensitive data in clinical reports and informed consent discussions,” he says.

topics:

Source

Be the first to comment

Leave a Reply

Your email address will not be published.


*