Skip to main content

Address

ICC 650
Box 571014

37th & O St, N.W.
Washington, D.C. 20057

maps & directions
Contact

Phone: (202) 687.6400

Email: provost@georgetown.edu

 

Protecting Privacy While Serving the Common Good

One key challenge facing the new world of high-dimensional data is the protection of individual privacy.

An interesting fact about statistical uses of data is that there is no value in inspecting individual records. Statistics relies on aggregations of records. A sample size of one is generally uninformative. Real statistical users are uninterested in individuals. Hence, all their operations can be done without any identifiers on the case records.

Statisticians collecting data commonly promise those from whom data are sought that their data will never be revealed in a way that is associated with their identity. If the data providers completely trusted the statistician, there would be no concern about whether the individual information about the person would be revealed to any other persons.

Over the decades statisticians developed methods of protecting the data from abuses. Some organizations have contractual arrangements with data analysts such that if data were revealed, the analyst (or his/her employing organization) would agree to pay large financial penalties. All US Federal statistical agencies (e.g., the Census Bureau, Bureau of Labor Statistics) have laws covering access to confidential data that provides criminal fines and imprisonment, if breaches occur.

Most statistical organizations separate identifying information from other data as soon as possible after collection. Many organizations sponsoring statistical analysis construct computer environments detached from the wider Internet, with sole access given to those with pre-vetted access rights to the data. When merging together multiple data sets on the same individuals, some organizations create personal key variables that uniquely identify the person, but strip off all other identifying data. The file linking the unique key with the other identifying information is kept offline, to protect it from any hacking possibilities. The final merged files may not even been permanently assembled. Separate component files are stored in different locations.

When data are made available for public statistical uses, most organizations conduct rigorous “disclosure risk” assessments. After stripping off the obvious personal identifiers (e.g., name, address, administrative identifiers), individual variables that might be indirectly identified are “perturbed.” That is, quantitative variables, like income, might be collapsed into categories, with a top-end category (e.g., $250,000 or more) that assigns the same code to many records. Alternatively, random statistical “noise” is added to the data, in a manner that does not affect arithmetic means of the individual variables. Variables not susceptible to such coding are not included in the anonymized public file. Finally, techniques of “differential privacy” are introduced, so that statistical analyses conducted on a data file are altered to a degree that individual identities have known levels of protection.

It is true that all privacy-protecting procedures depend, in one way or another, on the integrity of the people handling the data. If I, as a data provider, don’t trust that the procedures reviewed above are implemented as advertised, I would easily feel that my privacy was at risk. But I also suspect that few of the public have knowledge of the processes above.

My own personal records have been hacked, from multiple organizations in widely publicized events. Each of the organizations was using my records, not for statistical purposes, but for administrative purposes (e.g., personnel, credit). Those record systems need personally identifying information to provide their value to the administrative processes they serve. They cannot use many of the data protection techniques above.

In looking forward, what can statistical organizations do in order to generate trust? Many of the above procedures protect the identity of individuals. In this regard, statistical uses of data permit a much safer environment for individual data than administrative environments, whose uses of the data demand identifiers. As I think to the future, I wonder whether trust of data providers for statistical uses might be improved if they knew this.

One thought on “Protecting Privacy While Serving the Common Good

  1. This is a great start and it is fantastic that population statisticians, the original “big data” scientists, are working to educate people about the potential abuses and issues around collecting personal information.

    Some things are bleaker than the post hints at. For example, having a system physically disconnected from the Internet is no panacea for data protection. Two spectacular, public examples of this is Stuxnet, where Iran’s physically disconnected uranium processing systems were destroyed and SIPRnet, where the U.S. government’s systems were infiltrated and data exfiltrated, even though they were physically disconnected form the Internet. No amount of air gaps or magic crypto boxes protected either of these systems from attack and, in the case of SIPRnet, data loss. As such, physical disconnection from the Internet is a good, best practice, as it deters the amateurs, but it is by no means sufficient to protect individual’s data.

    Likewise, obfuscating data elements may seem like a sensible way of protecting personal information. If your name and address are replaced with random tokens (the “unique key” mentioned above), your identity is protected, no? Well, the answer is ‘No.’ Even with seemingly random identifiers, given enough data points (IBM has a patent where the magic number is seven), one can reconstruct with very high accuracy who the individual is.

    Is all hope lost? The good news here is again, the answer is ‘No.’ Work by folks like Profs. O’Neill and Newport on distributed, collaborative computation and Profs. Sherr and Zhou on privacy preserving protocols can provide cryptographic protections over the data, such that no one, not even the people collecting the data, can figure out who is whom, yet still get statistically meaningful results from analyzing the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

Office of the ProvostBox 571014 650 ICC37th and O Streets, N.W., Washington D.C. 20057Phone: (202) 687.6400Fax: (202) 687.5103provost@georgetown.edu

Connect with us via: