As we move from the world of designed data (through surveys, censuses, and administrative forms) to one of self-generated “big data,” or more organic data, we are excited about completely new ways to describe our world, the behaviors of humans, and the activities of organizations. This will be a more complicated data world than the one that forms the basis of current most social and economic statistical indicators. Most importantly, the analysts who produce statistical descriptions in this new world are unlikely to control much of the processes that generate the data.
In the old world of designed data through surveys, a common model of the production of data involves four steps: 1) comprehension and interpretation of the question being posed, 2) retrieval from memory of information relative to the question’s perceived intent, 3) formation of a judgment regarding retrieved memory and the question’s intent, and 4) delivery of a response. The framework begins by the stimulus of a question being posed to a respondent — this is the proximate cause of the provision of the datum (i.e., the answer to the question). This framework became valuable with designed data because it focused attention on weaknesses in the wording of questions (affecting the comprehension stage), biases in memory retrieval, and biases in judgment (e.g., reticence to reveal socially undesirable attributes).
Our new world of data contains data created without a uniform stimulus, as in a survey question. Let’s imagine we’re using Google search terms as a data set. Let’s say we’re interested in estimating the number of persons searching for work through examination of such search terms. Not unlike the logic of Google Flu, we plan to code the text strings of searches as more or less relevant to employment search behavior and create an index of employment search based on the coding. There are many interesting problems here.
Let’s examine the possible behaviors of persons who are indeed actually seeking employment. What behaviors on their part will generate a Google search? We might speculate that some would be seeking a well-known (to them) electronic job listing service and using Google instead of a direct URL to the job-listing site. Others would avoid Google and go directly to such a site.
Still others may be seeking information for alternative employment opportunities, using the search tool to locate such sites. For example, they might be looking at the job listings of specific companies on the companies’ websites.
Still others may be doing their search via personal networks they maintain, using word of mouth as a source of information on relevant job openings. No Google search use would be required.
With these examples, Google data systematically miss some employment search behaviors.
Let’s say I switch the data source to Twitter, attempting to use it to track the same phenomenon — what’s the current volume of job search activity in the country?
Let’s again focus on the behavior of Twitter subscribers who are searching for work. As a subscriber, the data they emit are tweets, retweets, and choosing others to follow, among others. If they are engaged in job searches, what would prompt them to issue a tweet about their search? Would they be more likely to tweet information that positively reflects on their job search (e.g., “I have a job interview”) as well as other information (e.g., “I’m losing hope at ever finding a job”). Such filtering of tweets motivating by attempt to manage self-presentation might be viewed as akin to the social desirability bias in answering survey questions concerning embarrassing attributes.
Now, let’s change the focus to those who are not looking for work. Some may enter Google search terms about employment for an unemployed family member or friend; some have an ongoing interest in what jobs are open in their field. Their search terms might be precisely the same words as those seeking work.
These are simple examples that force attention to the process that generates the data. In some circumstances the behaviors we are attempting to monitor may not be recorded in the big-data source, or might be recorded only under special circumstances. In survey data, much attention was paid to the formation of the data stimulus (the question), in an attempt to understand the meaning of the resulting data. In this new world we need similar attention to identifying the stimulus to the data production. In this case, the stimulus will tend to be unobservable from the data themselves, but building conceptual frameworks that force careful considerations of alternative stimuli is key. Only then can we interpret the both value and foibles of the organic data.