There has been a dominant paradigm in much of empirical social science for most of my life — conclusions based on statistical analyses of samples from large populations fully covered by some sort of listing of their members. This paradigm has supplied national statistical monitoring (e.g., the monthly unemployment rate from the Current Population Survey), causal inference about electoral behavior (e.g., the National Election Studies), insights into the mechanisms leading to poverty (e.g., the Panel Survey of Income Dynamics), and a host of other important knowledge sets.
A parallel branch of studies exist, but these studies are focused on rich data on local areas (e.g., the Framingham Heart Study), longitudinal studies of special populations (the Bennington College Study), and a multitude of qualitative ethnographic studies of urban communities.
What are the attributes of social science data that are valued and how do survey data match up to them?
1. Surveys tend to be weak on spatial granularity.
While the sample survey serves the needs of national description, it often fails to be useful to policy makers at the local level. For example, while the National Crime Victimization Survey provides consistent estimates of crimes experienced by people in the US, it is not large enough to provide estimates for most individual urban areas, let alone neighborhoods. But actions and policies on crime are often implemented at the local level.
2. Surveys are weak in temporal granularity.
The sample survey is slow, consisting of complicated design, collection, and processing steps. Because of these steps, most of the information we have on the US population is based on measures repeated annually, with a few, monthly. Yet, as our world filled with social media input has taught us, events affecting almost every social and economic phenomena are happening at breakneck speed. Many policy makers thirst for information about the “now,” not the “then.”
3. Surveys are weak on subpopulation granularity.
As developed countries become more diverse through immigration, it’s more important for them to have up-to-date information on individual groups. Sometimes the group forms very small sets of the full population. National sample surveys have trouble producing strong estimates for small groups.
4. Surveys are strong on measurement capacity.
Surveys captured the attention of empirical social scientists because they permit the researchers to design the data that they analyze. In doing so, the researcher often induces into the measurement multiple indicators of different attributes of interest on the persons studied. These multiple measures feed the multivariate statistical models of the scientists as they seek to gain insights into the phenomena of interest.
5. Surveys are weak on measuring networks.
Finally, sample surveys have often been built on the selection of individual units (i.e., persons, families, and organizations). Measures are taken on sampled individuals. But increasingly social scientists have become interested in how people are influenced by those around them — their families, their neighbors, people in their church/synagogue/mosque, and co-workers. Part of explaining human behavior is informed by knowing what “connected others” are doing.
6. Surveys are strong on inferential frameworks.
Finally, samples of large populations selected with known probabilities offer strong conclusions about the full population. This statistical property has led to confidence in the findings of surveys to describe basic attributes of national populations, permitting confident decisions based on them.
In sum, most social science is built on comparisons over space, time, subpopulation, measurements, and networks. Confidence in conclusions is enhanced with proper statistical samples of the population being studied. The above arguments show surveys relatively weak on four of the six dimensions.
As the apparent speed of modern life increases, timely information seems more valuable. As the diversity of societies increase, local and subpopulation statistics seem more valuable. As we gain more insight into social networks, increasingly we seek to measure the effects of the context of humans.
The data world that we need to construct to understand human behavior more fully needs to be sensitive to these six needs. The “big data” world offers near real time data from some sources, coverage of wide spaces and populations (but without any documentation about what’s being missed), data sets much less multivariate than survey data, data that is not always geospatially identified, and data that often have network structures to them. Therefore, the big data world tends to be weak on three of the six dimensions.
Our job is to put the old data world of surveys together with the new data world of “big data.”