It’s a great time for data lovers. Each day we see a new set of findings from naturally-occurring data sources. Some of these are from social media; some from digitized financial transaction records; some from web-scraping.
We all know what we don’t know about these sources. We don’t know what portion of all persons/units/events studied are covered by the source. We don’t know whether the intentions of the data producer are aligned with the use of the data by the analyst. We don’t know the causes of missing data among those covered by the source.
For example, knowing the trend over time in particular phases in web searches (e.g., “support Trump”), as an indicator of saliency of issues to the population, suffers from such problems. Who is “off the grid?” Who is interested in the issue but did not search for related phrases? Who is uninterested in the issue but did search for the phrases? Who entered the phrases multiple times?
As we move into the era with more uses of these “naturally occurring” or organic data, we owe it to ourselves to build up a repertoire of quality checks. Since there are no real theories that allow measurement of the accuracy of such statistics, we have to rely on evidence outside the data source itself. No examination of the organic data themselves (e.g., variation over regions) will provide quantitative measures of the threat of bias or other inaccuracies of the statistics.
Instead, we owe it to ourselves to appeal to other data sources that could produce similar statistics (e.g., trends over time in contributions to the candidate, polling data). Confidence in the results from the organic source of data is earned when they are not in violent disagreement with other sources of information.
Dealing with such macro-level comparisons over time will from time to time reveal the underlying weaknesses of statistics based on naturally-occurring data. For example, the breakdown in Google Flu trends as predictors of the official statistics was such an event. We should embrace these moments. They permit studies of how multiple sources of data differ and when they are likely to send different signals about the underlying phenomenon.
There are many who now work toward a new paradigm of social and economic statistics, one in which statistical blends of multiple data sources are the basis of common indicators. That paradigm requires more thorough knowledge of the differences in the coverage of the population and nature of measurement processes across data sources. Assuming equivalence in those across data sources will be dangerous. Blending them properly requires knowledge of both how they are strong and how they are weak.
The days when a single data source deserves our complete confidence are limited in much social and economic measurement. The future will be won by those deeply attuned to the error properties of the data being blended. This can start now, by careful comparisons of conclusions reached from independent sources of data.