Skip to main content

Address

ICC 650
Box 571014

37th & O St, N.W.
Washington, D.C. 20057

maps & directions
Contact

Phone: (202) 687.6400

Email: provost@georgetown.edu

 

Measuring versus Harvesting Data: Implications for Quality

One of the wonderful features of modern life is that we are surrounded by data. Data come at us from various internet-related sources, from transaction files in service industries, and from devices increasingly serving our homes.

This wealth of data has become an increasing focus of social and economic scientists. “Big data” is a phrase that seems to be everywhere.

Researchers who invent measurements in order to advance understanding in their fields have developed sophisticated frameworks that describe quality or error properties of measures. Although the terminology varies over disciplines, all measurements are seen to have properties of undesired variability (“noise,” “unreliability,” or “imprecision”) and properties of systematic bias.

As studies of error properties of data have matured, most fields have become attracted to a notion of “fitness for use.” This phrase means that the impact of data error sources on the analytic conclusions is a function of the use itself. You and I could use the same data for two different purposes; your use could be unaffected by various error properties; and mine could be devastated by them. For example, simple arithmetic means may be sensitive to some errors in data that do not affect correlation coefficients. Economists speak of the “concept-measurement” gap. Depending on what concept you wish a given datum to reflect, different mistakes can be made.

How does all this relate to our new world of ubiquitous data?

When your job is to invent the measurement or to design the instrument, you cannot avoid thinking about how the design of the measurement may be fallible. When you’re harvesting existing data, you’re often unaware of the processes that produced the measurements. There is nothing to force you to attend to these data properties when you analyze the data.

These days, uses of the data without sensitivity to error properties are commonplace. Google Flu depends on the relationship between search terms (e.g., “achy shoulders” or “runny nose”) and diagnosed influenza. When the relationship between the two attributes changes, for example, because of heavy media reports about influenza, then Google Flu’s ability to predict change in real flu cases itself is altered. Not being sensitive to these error properties of the indicator can lead to mistakes.

For uses of “big data” to describe large human populations attributes well, we all have to ask whether all members of the population are covered by the data system being analyzed. For example, if Twitter data are used, what kinds of people are and are not active on Twitter? Further, what types of Twitter subscribers choose to use tweets to record a given attribute (e.g., job loss) and which don’t?

In addition, we have to ask the question of whether the data harvested from the data system are exactly equivalent to the phenomenon we wish to measure. For example, if my tweet has the words “fired” and “job,” I could have said “I just got fired from my job” or “I’m fired up about my job,” or “I fired her from the job,” or “I never want to get fired from my job.” Transforming words into quantitative indicators always entails some slippage of measurement. Whether the slippage hurts one’s analysis depends on the purpose of the analysis.

An extreme statement of “fitness for use” implies that there is no such thing as data quality without a specific use. Data without a user have no quality attributes. I’m happy with that formulation as long as each data user accepts the burden of critical review of the mismatch between the data and his/her analytic use of the data.

“Big data” without careful attention to properties of the data can produce big mistakes. As more and more researchers analyze data that they had no role in producing, we need more care, not less.

4 thoughts on “Measuring versus Harvesting Data: Implications for Quality

  1. I enjoy this post because it raises all sorts of questions beyond those explicitly addressed. To have the use in mind when even thinking about “harvesting” data reminds me of the role of the observer in quantum physics. Of course we cannot continue to think as if we were living in a universe governed by classical physics. — You raise a methodological question. — What is sorely missing, however, is the ethical question that relates to the availability, (ab)use, and ubiquity of data gathering (harvesting — hmm, that might be a word that means more than it wants to say.) I would hope that at Georgetown, the ethical questions would take a significant position in the discernment on these issues and not a back seat. Does the Kennedy Center have much to say on this? I would hope that we would become leaders in that questioning as much as in the appropriate methodological utilization of big data.

  2. Another excellent post with several key points, thank you. I was reminded of one of my favorite quotes related to the topic:

    “Big Data’s strength is in finding associations, not in showing whether these associations have meaning.”

    which comes from an excellent recent article on big data challenges:

    Khoury MJ & Ioannidis JPA. Science 28 November 2014:
    Vol. 346 no. 6213 pp. 1054-1055

    Best
    Paul

  3. Very interesting, especially with the GOOGLE FLU analysis. i believe my son had a classmate Gu Computers ’04 who i think was later developing a system to make Google searches more accurate and meaningful. It seemed to be trying to address exactly to error problem with Google that you are discussing. So, this is very interesting. I don’t know where he went with that but there was a young Hoya dealing with this issue back in the later 2000’s.

Leave a Reply

Your email address will not be published. Required fields are marked *

Office of the ProvostBox 571014 650 ICC37th and O Streets, N.W., Washington D.C. 20057Phone: (202) 687.6400Fax: (202) 687.5103provost@georgetown.edu

Connect with us via: