The best data analysts don’t trust data. They approach them with suspicion and disbelief. They probe them, looking for weaknesses. Only after repeated failures to expose fatal weaknesses do the analysts feel safe attempting to extract information from the data. Even then, they are careful not to try to extract more information from the data than they are capable of providing.
Increasingly, social scientists will be using data that they themselves did not create. The organic or “big data” will come from sources that did not foresee use by a social scientist to seek understanding of attitudes or behaviors of individuals or groups. In contrast, when social scientists themselves were the creators of data sets, they generally chose the target population to be one of interest substantively, designed the selection technique to identify which units were exposed to the measurement, carefully controlled the measurement, and induced features to reveal the validity of the measurements. The researchers themselves controlled the documentation of the data properties (so-called metadata).
In the new world in which we live, social scientists will increasingly harvest data created by processes far removed from their control (or even their access). So, in this new world in which social scientists are confronted with a data stream of unknown properties, how can they assess the quality of the data?
It’s likely that careful distinctions need to be made between observable and nonobservable errors. On the nonobservable side, no data set will reveal what is excluded from itself (e.g., what members of a target population can never be measured through the process). Insight into such biases comes only from comparing one data set to another.
Other weaknesses of data can be revealed through careful examination of the data set itself. Patterns of missing data for units that should have such items present can reveal that one or more items might be subject of weaknesses. Simple distributions of the values of quantitative measures can detect outliers that are unlikely.
There are some multivariate checks that are possible when a data set has multiple observations on the same unit. Are logical relationships between multiple measures (e.g., age and date of birth) displayed in the data?
Sometimes unreported disruptions in the measurement can be detected when there are time stamps on observations: does the pattern of characteristics over time reveal large discontinuities? Are the correlations between attributes over time varying in an erratic fashion? Sometimes graphical displays of multiple variables reveal anomalies that are likely to be errors of observation. Sometimes creating arithmetic combinations of multiple variables reveals unlikely patterns.
“Big data” are likely often to be undocumented data. The first step of a wise data analyst of such data is not extracting information from them, but challenging them to justify their worth for any purpose. Even in the best of cases, however, such scrutiny will merely offer insights into observable errors, not those arising from failure to contain information on important units in a target population.
Great post, as are all of them.
Frankly, every student in every college or university should be required to take a statistics course so they can understand how data may be manipulated or just poorly collected and/or analyzed by those with agendas.
Any chance we can get a statistics and an economics course requirement to replace the “diversity” requirement?
We talk about the importance of statistics (and other matters) here:
http://www.georgetownacademy.com/arguments-ideas/#Statistics
Working with the public and with government officials who are not trained in data, I found this to be one of the most difficult truths to effectively communicate.
We not only need to train data literacy, but through that literacy help empower every day people (not just the privileged, college-educated) to bring a healthy empowerment to asking questions of data-driven conclusions that impact their lives.