Given Georgetown’s McCourt School of Public Policy building of the Massive Data Institute, we find ourselves in conversations about “big data” and common good uses of data every day of the week.
I sometimes think of this moment in history as similar to that when machine-powered automobiles were being invented. Many of the first cars looked like converted buggies originally designed to be pulled by horses. Horse power was the dominant paradigm of travel, and there needed to be a period of time to move from the old paradigm to a new one, where cars began to resemble what they have become for many decades.
I think we’re at a similar point with social science data. For some decades, at least from the 1940’s, a dominant paradigm for measuring human thought and behavior in large human populations has been the sample survey. Sample surveys are the tool most countries have used to monitor unemployment rates, industrial production, retail sales, household incomes, educational achievements, voting behavior, and customer satisfaction. In the academy they have been used to study the transmission of wealth over generations, the antecedents of poverty, and even the prevalence of religious beliefs.
The necessary ingredients for a successful survey are a universal listing of all members of the population, a sampling that assured each member had a known positive chance of selection, successful measurement of those sample members, and statistical analysis reflecting the sampling process. With those ingredients the researcher was assured of unbiased estimates of the full universe of members with calculable margins of error.
Surveys offer simultaneous uniform measurement of many attributes of the sample members designed by the researchers themselves, but they were slow to conduct and, because of high costs, weak on reliable spatial or subgroup description. Further, with growing difficulties of obtaining high participation rates, survey costs have risen much higher than inflation.
“Big Data” comes along. These are data that are sometimes auxiliary to other service activities (like search terms, credit card transactions, and webpage content) or social media (Twitter, Facebook, etc.). Increasingly, they are data from sensors (i.e., the internet of things) (e.g., fitness band data).
How do we get from the old paradigm of probability sample surveys to describe large populations to a new paradigm? Some believe now that the advantages of big data will kill off the sample survey. They note that the new data resources solve the timeliness weaknesses of the slow sample survey. They note that spatial granularity is nearly limitless because we can tag the data with GPS coordinates.
But the big data bring weaknesses also. First, they don’t cover well-defined populations. For example, we don’t know in what ways Twitter, Facebook, and other platform subscribers differ from the full human population. Are they systematically younger/older, more/less educated, richer/poorer, more/less socially active, etc.? Since not everyone is covered by these data systems the data themselves don’t tell us what part of the picture is missing.
Second, the new data systems tend to be lean in attributes — we can’t from individual data sets build a rich description of the population member on multiple characteristics at the same moment. Hence, some call them the “exhaust” of real life, not real life itself.
So, what’s a social scientist to do? It seems clear that one important step is attempting to exploit the spatial and temporal granularity of big data and universal coverage, standardized measurement, and multivariate nature of sample survey data. This will require combining these two types of data sources.
This won’t be easy. We’ll encounter mismatches on all dimensions. Measurement quality will differ between data sets; time of measurement will differ; spatial granularity will differ. Statistical models will have to be constructed to bridge those differences.
I foresee a world in which we’ll blend surveys covering full populations (not just members of a platform) with continuous-time, sensor, and other data on subsets of the population. That may become our standard paradigm. To evolve to this paradigm, we need careful and sustained work of social scientists, computer scientists, and mathematical statisticians. It won’t be successfully accomplished by one discipline alone.
Hence, the Massive Data Institute must be multidisciplinary of necessity and, if successful, will be interdisciplinary.