One of the great advances in the empirical social sciences last century was the use of longitudinal surveys to disentangle some causality puzzles in key social and economic behaviors. For example, the Panel Survey of Income Dynamics, which follows families over decades inquiring about their income, wealth, and occupational statuses helped informed welfare programs. It found that for many families, instead of poverty being a persistent state, plaguing one throughout a lifetime, it was often the result of key shocks (e.g., the death of a breadwinner or the loss of employment). If there were support systems to help the family at the moment of the shock, recovery occurred more quickly.
Mounting such longitudinal survey vehicles, however, is a very expensive proposition. They require following movers wherever they might transition, pursuing members of a family who split off from the initial unit, etc. Because survey data come from designed measurement, the researcher must contact and seek the time of sample persons to collect the data. As a result of the cost, there are few longitudinal surveys, and most measure sample cases once a year or once every two years.
With organic data, those data arising from the Internet, sensors, and social media, the researcher is not the initiator of the data generation process. Instead the units themselves generate the data. Further, the data are “harvested” from their display/storage vehicle (a web page; a sensor data farm). This fact may have important impact on the future of longitudinal measurement and the type of causal modeling that can be conducted.
In an earlier blog, I pondered what conceptual framework would be necessary to understand the measurement properties of various sources of organic data (e.g., Twitter data). Discerning the motivation for sending or not sending a tweet about a given personal attribute clearly requires more data about the Twitter subscriber.
The wonderful feature of much organic data is that they are ongoing, dynamic, and near real-time streams. If the initiator for the data can be identified (e.g., through a Twitter handle), then following the tweets of an individual over time is very low cost. The burden of building a longitudinal record on an individual is much less than mounting repeated longitudinal survey measurements of them.
With longitudinal harvesting of organic data on the same person, profiles of the individual can be built. Some profiles that might be useful include whether the specific medium is used by the user in a narrow (e.g., tweets only about political opinions or comments on friends and family members) or a broad way (e.g., tweets about daily observations on all possible character). In essence, this might form an indicator of the background/common state of the data medium for that individual. When the individual behaved in an unusual way (e.g., tweeting on topics unusual for them), that could be interpreted within the context of their usual behavior. For example, comments on a scandal involving a political figure would be differentially valued for those who always comment on political events versus those who never comment on political events.
The profile of an individual itself would become a dynamic property of the data enterprise, tracking both changes over time that were trends within the individual (e.g., the growth of concerns about the environment), as well as providing information about shocks (e.g., concerns about Tom Brady’s NFL case).
One wonderful feature of the new data world is temporally-granular data. It will require us to think differently about what data to collect and how to use it to study change.