When I was about nineteen years old and taking a first course in empirical social science, I was given a computerized set of data and documentation describing the data. The class was told to pose any question we found relevant to the data, construct an analysis, and describe in words the results.
The liberty to construct one’s own questions was alluring. My naiveté led me to the typical failure of any first analyst – to examine all possible combinations of attributes to predict the question of interest. I still remember the pages and pages of analyses I produced. I let the software do my thinking. I had skipped an important step in inquiry – studying the results of past work relevant to my question. I hadn’t gone deep enough.
We all are seeing the results of similar superficiality in data analysis now. Extensive computation power coupled with very large data sets permit one quite easily to build predictive models of anything described by the data. Some of the models predicting behaviors, like choosing to click on a web-page advertisement, can contain hundred of thousands of attributes (prior clicks, web-pages visited, etc.) measured on millions of people.
Sometimes the models are tested using some gold standard set of indicators. Can the predictive models match the benchmark indicator? Do they seem to track over-time the real phenomenon they seek to predict?
There are several failures (e.g., Google Flu Trends predicting the annual course of the influenza). Testing predictive models against benchmarks is effective for the time the match is observed. However, they can fail outside those benchmarks.
Some predictive models repeat the mistake of my nineteen year old self. I was merely seeking to predict some measured outcome. I had no idea of the processes underlying the phenomenon I sought to explain. I had no theory. I had no understanding of the mechanisms that produce the outcome. Models built under such circumstances can work … until they don’t work.
Social scientists make large distinctions between the necessary ingredients for prediction and the necessary ingredients for causal inference. What actually produces the phenomenon that interests me? Without such understanding, predictive models usually suffer from a sin of superficiality. They are thoughtlessly built and are tested only on a limited set of circumstances. When the circumstances change, the model breaks down.
In a way, prediction without understanding key features of the causal mechanisms is like candy. There is an immediate gratification, but it can hurt you if you consume too much.
The lasting value of data-based models requires depth — depth of thinking about the outcome at hand; depth of interacting and observing the process. Usually this is qualitative, immersive activity. Sometimes the model builder can be assisted by those deeply conversant with the process, but it helps to have modelers who themselves know the process well.
Depth takes time. Depth requires different expertise than that required for data manipulation and algorithmic or statistical analyses. But depth has the payoff of model specification that is more robust to changing circumstances.
The lack of depth in the world of “big data” is more threatening, I fear, than in the world of data designed by the researcher. We can easily fall into a habit of “harvesting” data without deeply thinking about the outcome we are trying to predict. Since there is so much data, surely, we assert, we can easily build a strongly predictive model that will serve us well for a long time. The problem with “hitching our entire wagon” to existing data alone, however, is the case where the key mechanisms are not measured in the data. No amount of data will save us from the fate of missing the right attributes to measure (e.g., missing measurements of attitudes and other internal states that have not yet produced observable behaviors).
Lots of data missing key measurements and thoughtless application of statistical techniques are fatal temptations in this new world of big data. There is no nice shortcut for deep thinking about causal mechanisms.
Sometimes I wish the office of advancement professionals would understand that principle. Prediction without analysis and a theory behind to support it is not helpful in forecasting anything.
I think that another way of describing “deep thinking” is to inform statistical analyses using big data with our current understanding of mechanistic processes that generated the data. It’s one thing to search through the seemingly unlimited number of attributes in big data — that’s data mining, plain and simple. It’s another thing to use big data to address scientific hypotheses. Unfortunately, I think it’s a fine line between “starting out with a hypothesis” and “performing exploratory data analysis (including data mining) to refine your hypothesis”, and while these issues have already existed, big data may indeed exacerbate them.
In a similar vein, interested readers might like to look at a short opinion piece I’ve written on the challenges of the use of spatial big data in infectious disease modeling & inference. The pre-print is here: http://arxiv.org/abs/1605.08740. We don’t discuss this issue explicitly, but we do talk about other implications of the rising use of geo-tagged and otherwise spatial big data in infectious disease inference, policy, and ethics.
Thanks for your post!
Elizabeth