Each day, we seem to be inundated with two types of media stories simultaneously — 1) how “big data” will usher in a world of heightened convenience and efficiency for all and 2) how relentless tracking of our personal information threatens our autonomy as human beings.
In prior decades, much of our collective understanding of how people felt about issues, what activities they pursue, and what knowledge they possess about key issues facing their lives, came from direct questioning them. The questions were components of sample surveys, through which a scientific sample of the full population was systematically measured and their answers statistically aggregated to describe the full population. The selected respondents to these surveys were given pledges that their answers would remain confidential to the survey organization, and only statistical aggregations would be constructed by combining their answers with many others.
One property of this prior world was that respondents were aware of which of their attributes would be known through the survey (i.e., only the questions answered by them). A second property is that their participation was voluntary, and the proposed uses of the data could be a factor in whether they chose to respond. A third property was that most institutions collecting survey data earned the trust of respondents that the pledges of confidentiality would be honored.
Over the decades, this protocol worked fairly well. There were very few violations of the confidentiality pledges. There was effective dissemination of information to the public to describe key features of their world — how well the government was perceived to be fulfilling their needs; how well-off the public was on basic attributes of income, educational attainment, and health status; how safe from crime different populations found themselves; and how well businesses were performing. That is, by sample persons giving up their privacy to provide data held confidential and used only for statistical purposes, the full society was informed about how well it was doing. Indeed, the data were designed to achieve this common good outcome.
Enter the Internet and unobtrusive data collection on persons, users, and members of services.
This new world produces data as auxiliary to other processes (traffic management, search algorithms, mobile phone location identification, social media communication, and credit card use). We, as individuals, use these services and in return to the personal benefits of the services, provide personal information to the service (this is generally authorized in the fine print of use agreements that most of us don’t read but quickly hit the “Agree” button).
These data are attractive to social scientists because they are fine-grained temporarily (some almost real-time), they are plentiful (trillions of observations versus thousands of survey respondents), and they track some behaviors that seem important to understanding how society is functioning. Will they become the equivalent of the ubiquitous survey data of the 20th century?
What’s new about this world is that the data weren’t designed to answer any particular economic or social question. Further, they are lean in number of attributes measured on each observation (i.e., we don’t know a lot about whoever initiated the data burst). Finally, they are not held by institutions whose mission is to extract information for common good purposes. Instead, they largely come from businesses that use the data to provide their services.
Most social scientists feel that this new data world has promise to unlock new insights about human thought and behavior. But it’s a different world — there is no defined infrastructure to coordinate the access to diverse data sources. It seems clear (to me, at least) that the winning society in the future will create a way to address privacy concerns of data access, private sector data holder concerns, and needs of researchers to combine diverse data to create more insights. This will require a new set of structures to assure privacy rights of individuals and verification that the data usage does indeed serve common good purposes. If the new world does not combine these new data sources for common good purposes, we will all take a step backward.
Right on comments!