For those who believe that better decisions are made using data relevant to the issue, not merely human judgment based on individual experience, we live in a glorious age. Times were not always so. During the 1930’s in the midst of a depression, there was no reliable method of knowing what portion of the population was unemployed. The invention of statistical sampling with its strong abilities to describe the attributes of large population changed all of that. For the latter half of the 20th century, large-scale surveys were the chief tool used to describe our economic status, our satisfaction with the government and basic institutions, the distribution of educational attainment, health care access, and crime victimization — basically all we know about ourselves.
Much of these data were collected by government or academic organizations, and the results were freely shared with the public. In a democracy, freely-shared objective statistical information is a tool of an informed citizenry to determine its future. In a way, the time that the sample households spent in providing their answers to survey questions was recompensed by the benefit to the common good of better information about how the country was doing. Over time, public use data files with anonymized records were made available for research uses, catalyzing the quantitative social sciences. The common good was further supported by deep analysis of key questions facing all societies — which factors affect the likelihood of poverty, which government programs work, and which don’t.
Our world today is different. The sample survey data designed for common good purposes form a much smaller portion of data that exist on human behavior. Many Internet-linked processes generate real-time data. Sensors track utility use; digital transaction records track each health care use; cell phones record physical movements; social media document relationships among people; credit card records document purchases; search terms track information requests, and on and on. It is estimated that the size of these record bases tracking the populations’ behavior increases at a rate of 40% a year. Their sizes overwhelm all known social science and government statistical system data.
Much of the commentary about so-called “big data” has been focused on privacy concerns. Most of these data are collected by private sector entities. Those organizations “own” the data; they use the data to further the goals of the organization. For competitive reasons, they keep their uses of the data shielded. At the same time, commercial data-assembling firms are linking all possible data sources they can acquire to build large data sets used for marketing.
Concerns are raised about whether the uses of individual data may lead to indirect harm to some. Behavior that might be viewed as private might be revealed. The risk of embarrassment or more serious harm may be real.
There is ambiguity about who “owns” data about me. In providing data to another party, I am “loaning” them the data for specific uses; am I “gifting” them all control over my data?
There is a massive contrast between the current data world and the world built up by pre-designed surveys. In the former world, large numbers of attributes were recorded for small numbers of statistically selected persons; in the current world, very small numbers of attributes are collected on large, amorphously defined populations. The new data world consists very lean data, sometimes only recording one attribute (e.g., kilowatt per hour). Many of the questions facing society cannot be answered solely by such data sources. Many of those demand contrasts between subgroups (do large households consume more energy; what’s the impact of age of dwelling on energy use, of work patterns outside of the home, and of age of residents; what is the effect of energy use on health status of the household; do higher energy usage patterns affect health differently for poor and rich households; does education affect the relationship between energy use and health?).
Using “big data” resources to answer such questions requires a conjoining of different big data resources, often in combination with statistical sample survey data with known inferential properties. This will likely not be the construction of one big data set, but the statistical modeling of multiple data sets simultaneously, borrowing strength from each of them to understand key social and economic issues more fully. We at Georgetown are seeking to partner with diverse groups to build such a research environment through the McCourt School of Public Policy’s Massive Data Institute.
We will succeed only if we bring the privacy concerns about big data into the open, discuss them with those having these concerns, and find ways to ensure common good benefits are achieved while full respect is given to those concerns. “Big data in the service of others” should be the watchword of the McCourt School’s Massive Data Institute.
Being an economics student at Georgetown, I have been really excited by the recent developments at the McCourt School of Public Policy specially the announcement that it will have a Massive Data Institute as well. At the same time however, I have not been able to find any information anywhere about what the Massive Data Institute would be like. What is the goal? Would the data be strictly for students of the public policy school or would it be accessible to the broader Georgetown community? Would the institute provide room for inter-disciplinary research? And most importantly, what kind of data would it have? What exactly are the type of Massive Data sets the McCourt School or the University plans to acquire? Ever since the announcement of the establishment of the McCourt School of Public Policy, I have been trying to find some information about the vision for the school and specially for the Massive Data Institute and I have not found anything yet.