One of the central issues of the coming years is whether the vast amounts of data being collected on human behavior will be used for good or evil.
Common good uses of social media, web-scraping, and consumer transaction data must address the issue of how to prevent the uses of said data from leading to abuses. How can people have rest assured that data retrieved about them will not be used to harm them?
It’s common to observe that digital technologies are morphing at speeds faster than any regulation governing them can match. Laws protecting personal data seem naïve when reviewed in the context of what we know about technological capacities. Those who wrote the HIPAA regulation were well-intentioned in setting up rules to prevent reidentification of personal medical data by stripping off identifiers from the records. However, it’s fairly easy to demonstrate that removal of those data fail to assure that outcome. Research into reidentification has researchers assuming the role of an “intruder” to identify a person in a data set that is omitting obvious personal identifiers. The typical finding undermines any belief in the possibility of achieving “anonmyzied” data.
This area seems to be ripe for interdisciplinary work. There are multiple, albeit loosely connected efforts going on in different fields.
Computer scientists have been developing the idea of “differential privacy” as an antidote to undesired disclosure of the identity of someone in a record base when statistical analyses are mounted. The approach increases statistical uncertainty of information extracted from the data. The most beneficial advantage is that once certain parameters are set, the holder of a data set can know the level of risk incurred by revealing attributes of individual records. It formally acknowledges that repeated queries from the same data increases the risk of identification. But from computer science alone, there is no guidance on what level of risk is appropriate to incur.
Other computer scientists are creating working production systems where multiple data sets are combined for purposes of joint analysis but no linked product can ever be extracted. This is especially attractive when two data holders have no rights to access each other’s data, but share a desire for statistical products from a combination of the two data sets.
Statisticians have taken a different approach. Instead of altering the analysis outcome to protect privacy, they have used models to create “synthetic” data. The synthetic data sets are created to mimic the statistical properties of the real data set. That is, the averages of all variables in the real data are replicated in the synthetic data. Relationships among two variables are maintained, and so on. The synthetic data, although derived from real data describing real people, contains no data from those individuals. To improve the efficiency of analysis and to measure greater uncertainty due to its synthetic nature, often many synthetic data sets are created from the same real data set. However, this technique faces issues of how similar a synthetic data record can be to a real data record before the method has indeed revealed an individual.
Legal scholars and practicing lawyers are inventing regulatory frameworks that protect the privacy of individuals whose records lie in data sets, but also permits the extraction of information from analysis of the records.
Finally, some philosophers are taking on the issue of articulating principles of data ethics. How should individuals whose records are held by them think through risks of disclosure and benefits of having access to data? How should researchers who wish to behave ethically approach issues of privacy of data they hold? Who possesses the right to control what analyses are conducted on data? What promises of privacy can be kept and which cannot be kept?
It seems that creating a group of computer scientists, statisticians, legal scholars, and philosophers – all thinking about the same issues, but each from a different perspective – might be a great vision, in order to make progress on the issues of privacy, information, and policy formation.
Very timely discussion especially with the recent massive outage of the entire MedStarHospital computer system including Georgetown Medstar Hospital. Makes the HIPPA comments even more critical .