One of the great advances in the empirical social sciences last century was the use of longitudinal surveys to disentangle some causality puzzles in key social and economic behaviors. For example, the Panel Survey of Income Dynamics, which follows families over decades inquiring about their income, wealth, and occupational statuses helped informed welfare programs. It found that for many families, instead of poverty being a persistent state, plaguing one throughout a lifetime, it was often the result of key shocks (e.g., the death of a breadwinner or the loss of employment). If there were support systems to help the family at the moment of the shock, recovery occurred more quickly.
Mounting such longitudinal survey vehicles, however, is a very expensive proposition. They require following movers wherever they might transition, pursuing members of a family who split off from the initial unit, etc. Because survey data come from designed measurement, the researcher must contact and seek the time of sample persons to collect the data. As a result of the cost, there are few longitudinal surveys, and most measure sample cases once a year or once every two years.
With organic data, those data arising from the Internet, sensors, and social media, the researcher is not the initiator of the data generation process. Instead the units themselves generate the data. Further, the data are “harvested” from their display/storage vehicle (a web page; a sensor data farm). This fact may have important impact on the future of longitudinal measurement and the type of causal modeling that can be conducted.
In an earlier blog, I pondered what conceptual framework would be necessary to understand the measurement properties of various sources of organic data (e.g., Twitter data). Discerning the motivation for sending or not sending a tweet about a given personal attribute clearly requires more data about the Twitter subscriber.
The wonderful feature of much organic data is that they are ongoing, dynamic, and near real-time streams. If the initiator for the data can be identified (e.g., through a Twitter handle), then following the tweets of an individual over time is very low cost. The burden of building a longitudinal record on an individual is much less than mounting repeated longitudinal survey measurements of them.
With longitudinal harvesting of organic data on the same person, profiles of the individual can be built. Some profiles that might be useful include whether the specific medium is used by the user in a narrow (e.g., tweets only about political opinions or comments on friends and family members) or a broad way (e.g., tweets about daily observations on all possible character). In essence, this might form an indicator of the background/common state of the data medium for that individual. When the individual behaved in an unusual way (e.g., tweeting on topics unusual for them), that could be interpreted within the context of their usual behavior. For example, comments on a scandal involving a political figure would be differentially valued for those who always comment on political events versus those who never comment on political events.
The profile of an individual itself would become a dynamic property of the data enterprise, tracking both changes over time that were trends within the individual (e.g., the growth of concerns about the environment), as well as providing information about shocks (e.g., concerns about Tom Brady’s NFL case).
One wonderful feature of the new data world is temporally-granular data. It will require us to think differently about what data to collect and how to use it to study change.
The role of McCourt’s “Massive Data Institute” in collecting and disseminating such data remains a work in progress….
Interesting. Lots of potential but also very scary.
The “organic” data Provost Groves describes would indeed seem to have great potential as a new resource for academic research. Unfortunately, the use and publication of data collected in such manner and attaching to particular individuals carries the risk of running afoul of privacy and libel laws, with the potential for liability to both the researcher and the University. I would urge anyone at Georgetown planning to use this type of data to consult with the University Counsel’s Office before proceeding. The circumstances involved dictate the need for caution.
Bill Kuncik, Georgetown MALS