Skip to main content

Address

ICC 650
Box 571014

37th & O St, N.W.
Washington, D.C. 20057

maps & directions
Contact

Phone: (202) 687.6400

Email: provost@georgetown.edu

 

During the Incubation of a Blended Data World

It’s a great time for data lovers. Each day we see a new set of findings from naturally-occurring data sources. Some of these are from social media; some from digitized financial transaction records; some from web-scraping.

We all know what we don’t know about these sources. We don’t know what portion of all persons/units/events studied are covered by the source. We don’t know whether the intentions of the data producer are aligned with the use of the data by the analyst. We don’t know the causes of missing data among those covered by the source.

For example, knowing the trend over time in particular phases in web searches (e.g., “support Trump”), as an indicator of saliency of issues to the population, suffers from such problems. Who is “off the grid?” Who is interested in the issue but did not search for related phrases? Who is uninterested in the issue but did search for the phrases? Who entered the phrases multiple times?

As we move into the era with more uses of these “naturally occurring” or organic data, we owe it to ourselves to build up a repertoire of quality checks. Since there are no real theories that allow measurement of the accuracy of such statistics, we have to rely on evidence outside the data source itself. No examination of the organic data themselves (e.g., variation over regions) will provide quantitative measures of the threat of bias or other inaccuracies of the statistics.

Instead, we owe it to ourselves to appeal to other data sources that could produce similar statistics (e.g., trends over time in contributions to the candidate, polling data). Confidence in the results from the organic source of data is earned when they are not in violent disagreement with other sources of information.

Dealing with such macro-level comparisons over time will from time to time reveal the underlying weaknesses of statistics based on naturally-occurring data. For example, the breakdown in Google Flu trends as predictors of the official statistics was such an event. We should embrace these moments. They permit studies of how multiple sources of data differ and when they are likely to send different signals about the underlying phenomenon.

There are many who now work toward a new paradigm of social and economic statistics, one in which statistical blends of multiple data sources are the basis of common indicators. That paradigm requires more thorough knowledge of the differences in the coverage of the population and nature of measurement processes across data sources. Assuming equivalence in those across data sources will be dangerous. Blending them properly requires knowledge of both how they are strong and how they are weak.

The days when a single data source deserves our complete confidence are limited in much social and economic measurement. The future will be won by those deeply attuned to the error properties of the data being blended. This can start now, by careful comparisons of conclusions reached from independent sources of data.

2 thoughts on “During the Incubation of a Blended Data World

  1. Very interesting and important comment on “naturally occurring data”. We do need to use other means to check that data out. Your discussion reminded me of what I would tell parents of my child patients when they pushed for “naturally occurring substances” which might be safer than prescription medicines even though there are not good studies to show the efficacy and safety of those natural substances. My thought for them to ponder was that not all that is natural is necessarily good. I would tell them that cyanide us a ” naturally occurring substance “. But you wouldn’t necessarily give your child that over a well studied medicine would you? I know my analogy is not so good but I thought of this in your caution that we need to somehow study in other ways to look deeper into those “naturally occurring ” studies. Just a somewhat loose thought but maybe not a totally weak analogy!

  2. I agree we should not just rely on a single source of data but we should use different sources to measure the accuracy of the statistics we get.

    Therefore the so called “repertoire of quality checks” as you have put it will be more necessary. Relying on evidence outside the data source itself is always a good way of determining the quality, accuracy and authenticity of the data.
    For example a twitter trend should be measured against a Google trend. If the two correlate then we can say the statistics are accurate.

Leave a Reply

Your email address will not be published. Required fields are marked *

Office of the ProvostBox 571014 650 ICC37th and O Streets, N.W., Washington D.C. 20057Phone: (202) 687.6400Fax: (202) 687.5103provost@georgetown.edu

Connect with us via: