An old joke describes a severely inebriated man who is found looking around, on his knees under a streetlight. A passer-by asks him what he is doing. The man says that he lost his keys and is looking for them. The passer-by asks if he’s sure he lost them there. The man says no, he thinks he lost them in the park nearby, but the light was so much better where he was looking.
The story fits some fields of academic inquiry in the sciences and social sciences. For example, much of the empirical work in macroeconomics relies on aggregate data produced by private enterprises and government statistical agencies. The field is somewhat removed from the production of those data. Hence, some of the literature focuses on the gap between theoretical concepts and the data available to reflect them. There were, however, few alternatives. Those data were the streetlight for the field.
In contrast, other sciences collect original data. They conceptualize what observations must be taken to test alternative ideas, how to mount the measurement, how to construct instruments to implement the measurements, and then how to process the data to address the research questions they pose. By collecting the observations directly, they learn the fallibilities of the measurements. By measuring features of the phenomena, they become more sophisticated about the mechanisms producing the phenomena.
Recent events have highlighted this distinction between science based on original measurement and science that starts with data produced by others. Implicit biases discovered in machine-learning based algorithms are hitting the popular press. The algorithms under scrutiny were based on data sources that were available at the times the algorithms were built. That was their streetlight. Unanticipated was poor performance for phenomena that were not part of the original data set. For example, the misidentification of persons of color in facial recognition seems to be related to the rarity of images of persons of color in the training data set for the algorithms.
Algorithms that guide loan risk or health risk assessment can fail if the data sets do not contain measures of all the attributes that affect risk. The best data come from deep understanding of the mechanisms that affect the likelihood of loan default or health conditions requiring medical interventions.
Some of the data sets used as the basis of the machine learning were very, very large in numbers of different units observed. But the total size of the data set is of little relevance if the characteristics of the data set do not match the real-world phenomena the algorithm will face.
Data scientists are increasingly realizing that building sophisticated algorithms on weak data is problematic. Faced with the choice between unsophisticated algorithms derived from rich data describing the mechanisms affecting some outcome of interest versus sophisticated algorithms based on weak data, they’re arguing that better data sets are a faster route to impact.
To build more useful data sets, we need data scientists’ attention to the measurement step producing the data, as well as the analytic step. Depending only on convenient, bright streetlights may fail us in locating the keys.