Questioning the limits of genomic privacy.


To the Editor: Recently, Im et al.1 presented a method that can infer an individual’s participation in a study when regression coefficients from quantitative phenotypes are available. They demonstrated that in an era of increasing use of high-throughput technologies to integrate multiple-omics data sets, the “problem of identifiability” necessitates the creation of robust methods (e.g., an annual certification process) that facilitate broad dissemination of study results without compromising a participant’s privacy. In this letter, we would like to qualify the conclusions of Im et al., and several other commentators,2–5 by illustrating that (1) despite the perceived ease of reidentification, anonymity (and genomic privacy in general, which subsumes anonymity and identifiability as critical elements of informational control) remains a valid and vital concept and (2) technologies and models currently exist that facilitate dissemination of useful health data without compromising privacy. We think that the topic addressed by Im et al. is all the more critical given that the European Union (EU), the United States (US), and other jurisdictions are presently reforming their privacy, data, and human subjects research protection frameworks.

As policymakers, scientists, and the public grapple with the growing data deluge and concerns about privacy, a key issue will be to examine the legal definition of “personal data.” The EU’s newly proposed data protection regulation defines personal data as “any information relating to a data subject.” A data subject is an “identified natural person” (i.e., a person whose identity data, such as name, address, or birth date, are known) or a “natural person who can be identified, directly or indirectly, by means reasonably likely to be used by… [a]… person.”6 A recent revision to the proposed regulation’s definition of “personal data” adds that “[i]f identification requires a disproportionate amount of time, effort, or material resources, the natural living person shall not be considered identifiable.”7 In the US, according to the Health Insurance Portability and Accountability Act of 1996 (HIPAA), “individually identifiable health information” is information that identifies the individual or for which “there is a reasonable basis to believe it can be used to identify the individual.”

Neither the EU’s proposed data protection regulation nor HIPAA provide definitions of “anonymous” or “anonymization,” which have distinct technical meanings,9 but nationally and internationally recognized definitions of “anonymous” exist, though they unfortunately continue to lack terminological and technical standardization.3 For example, the EU’s Article 29 Working Party defines anonymous data as “any information relating to a… person where the person cannot be identified… taking into account all means reasonably likely to be used.”10 To us, this is a clear recognition of the concept and utility of anonymous data. Yet, when it comes to biological data, like DNA parameters, many believe that anonymity simply no longer exists because the legal term “identifiable” seemingly now applies to everyone because every “anonymous” or “anonymized” person can sooner or later be identified by some technology and method.

This argument overlooks many critical points. First, a biospecimen in itself does not contain identity data. Even if it can be determined with a certain probability that a biospecimen originates from a specific individual by matching DNA data, such matching is different from assessing the identity of an individual.11,12 Furthermore, the more uncertainty there is in determining data for reidentification, the more anonymous the data become; absent true data authenticity, reidentification risks are minimal.13 Even when reidentification on the basis of deidentified or anonymized biomedical data would be possible because databases with voter registration data, hospital discharge data, and court proceedings are accessible, a survey showed that reidentification on the basis of properly “deidentified” (to say nothing of anonymous) data is extremely difficult to achieve in practice.14,15 In sum, lending unreasonable credibility to remote risks of reidentification confuses multiple, justifiably separate legal definitions of “personal data,” “data subject,” “anonymous,” and “anonymized” and leads to a burdensome “gross overexpansion of the [privacy] legal framework.”16 This in turn threatens the advancement of anonymity as a practical concept, curtails beneficial uses of data, and reduces the incentive to anonymize data or collect anonymous data.17 In both science and in law, then, data anonymity vitally remains an ongoing concern. Remote exceptions cannot form the basis for a common rule. Data is not “personal” if “anonymous” or “anonymized.”

Second, similar to our objections to those who treat all data as “personal,” we think that there is a widespread failure to accept the rapid technological progress being made, particularly in genomics research and population biobanking, to simultaneously protect an individual’s privacy interests and promote scientific and biomedical breakthroughs.18–20 Current practices such as data access agreements already incorporate the annual certification process that Im et al. propose.21 There are ample reasons to move past the stale dichotomy and false choice of privacy or data utility and to embrace the possibilities of emerging technologies, processes, and projects. Far from potentially harming participants and researchers, methods and emerging technologies that work within a regulatory framework or legislation demonstrate how anonymity may facilitate innumerable benefits.

Certainly privacy protection remains the most pressing concern within the interface of medical research and public participation. Indeed, there are areas that warrant greater focus by the scientific community, such as group-based privacy issues where, for example, “nontransparent allocation of individuals to groups based on known or inferred traits or some combination thereof can raise issues related to the ability to protect one’s own interest and avoid discrimination.”22 We share the concern of Im et al. and others that as science and technology advance, the use of additional human characteristics such as data will pose challenges to privacy interests, which may need to be reconceptualized to remain relevant in 21st century science and medicine. Yet, concerns regarding the “problem of identifiability” as a veritable limit to genomic privacy must be tempered with nuance. It is only through recognition and acceptance of the ongoing practical utility of data anonymity, use of evidence to conclude that the risk of reidentification is remote, and adoption of successful emerging practices and technologies that we can achieve a “win-win” situation. Anonymous and useful data can be legally and ethically bridged while respecting t