As social scientists increasingly encounter new data resources, especially those in the so-called “big data” realm, they’re finding a new challenge to identifying the proper quality framework to use.
For some years, much of empirical social science was guided by a framework of inference from a sample-based data set to a large well-defined population. The data and statistics derived from the data were evaluated through the lens of a “total survey error” framework, often presented in the chart below:
Some of this framework focused on quality properties that caused biases (consistent, systematic error) between the population from which the sample was drawn and the sample generating the data. For example, if the survey data came from a web-based survey, the researcher has to attempt to measure the impact of missing persons with no web-access (this produced “coverage error” on the chart above). Would those without web access have given different answers to the survey questions than those with web-access?
We’re now inundated with statistics from Twitter and Facebook and other social media platforms, but few studies using such data ask the question whether the nonsubscribers to those platforms would be different on the statistics published.
Worrying about the biases in statistics due to missing observations, however, does not require a large renovation in the quality framework above used in surveys.
A more important difference between so-called organic data (e.g., from social media) and designed data (e.g., from surveys) is that the researcher controls the observations in designed data but does not in organic data. Much of the survey error framework acknowledges possible mismatches between the desired target of measurement (e.g., the status of being employed) and survey questions that are asked in the questionnaire (e.g., “Last week, did you do any work for pay”?)
With the new data resources available to researchers in the big data world, a different kind of measurement issue arises. The researcher is merely “harvesting” the “exhaust” of people as they live their lives. What’s in the exhaust is not controlled by the researcher. For example, what would motivate a tweet that says “I lost my job today.” What type of Twitter subscriber who did indeed lose their paying job would choose to tweet this? What type of Twittter subscriber who lost their job would choose not to send such a tweet? If a subscriber is unemployed, what is the probability that he or she would tweet evidence of that status repeatedly during their unemployment spell? Would a subscriber ever send such a tweet despite being employed for pay? Do people holding multiple jobs behave differently than those who hold only one job?
To construct a useful quality framework for such organic data, the researcher needs to tackle the question of why a person would choose to provide information on the platform. Understanding the motivation is key to knowing the signal to noise ratio in the data for a given phenomenon. The probability of creating such evidence must be known for both those who have the attribute and those who do not have the attribute.
This kind of quality feature of big data cannot be measured within the big-data set itself. Such biases aren’t corrected by having larger data sets; the errors stem from inherent mismatches between the target of measurement and the processes producing the data.
Further, we don’t have language for this type of data quality feature. Candidates might be “the propensity to report an attribute,” “the likelihood of signaling,” or “match between the attribute and the signaling.” None of these are pithy.
Great care will be required in the move from data that were designed for a specific analytic purpose to data harvested from digital traces naturally occurring. We need serious discipline about big-data quality.