As the observational sciences increasingly turn toward use of so-called “big data” organic sources, much attention is being paid to how to analyze combinations of multiple data sets. A common goal is conjoining survey data with multiple other data sets. Many of the approaches to this problem attempt to identify common measurements in different data sets. For example, if data set 1 contains measurements on the spatial location of units measured, the analyst asks whether data set 2 contains similar measures. When that is indeed the case, statistical models taking advantage of this “shared covariate” can be estimated.
Thinking about this effort, it’s relevant to observe the common uses of observational data on large populations (e.g., people, businesses). Typical analytic goals are comparisons over five different dimensions: 1) time, 2) space (e.g., political subdivisions), 3) population groups (e.g., gender, race, age groups), 4) levels of aggregation (e.g., persons, families, households, neighborhoods), and 5) measurement complexity (e.g., individual attributes, indexes). Statistical agencies routinely present their estimates organized by these different dimensions.
One characteristic of our new data world is that much of the organic “big data” are not designed to contain a large set of consistent measures. Tweets are totally in the control of the subscriber; search terms are unstructured. Retail scanner data are consistent (time, retail outlet, product description, quantity, and price) but quite limited in number of variables or attributes.
There are implications of these data attributes for the future of combining multiple data sets to produce statistical estimates.
Perhaps the most common measurement in the “big data” world is the time at which the data were created (e.g., when was the tweet sent; when was the search term entered; and when was the product purchased). The spatial location of the measurement might be the next most frequent occurring measurement.
This implies that mixing survey data and these new organic sources will most likely enrich statistical estimates of small geographical areas (exploiting shared spatial identifiers) and estimates of greater temporal granularity (exploiting the time stamps on observations). It also implies that the new data world is less likely to offer enriched estimates of small subpopulations because the organic data tend not to have such measures. (The exception might be social networks.)
In studies of human populations, surveys have excelled in providing insights into demographic subgroups. They provide statistical contrasts among racial and ethnic subgroups, groups differing on socioeconomic status, immigration status, educational attainment, and a host of other important social attributes. It seems unlikely the that new data world will offer enhanced contrasts on such groups, since the organic data tend not to have measures on such attributes.
The implication for this seems clear. Estimates in the new data world might offer better enhanced temporal and spatial granularity, but little improvement for population subgroup contrasts. The survey data will carry the burden of measuring such contrasts, and added organic data will improve such contrasts only marginally.