I recently read a news article about a household tragedy related to sloppy use of big data. The offending event was connected with a digital portal that provided street addresses based on an IP address for an Internet connection. IP addresses cannot be mapped specifically to a street address in many cases, but can be mapped to a smallish geographical area. However, some IP addresses cannot be mapped at all. For those IPs, the mapping service chose a spot in the middle of the country, in Kansas, as the “default” position. Since the mapping problem with IP addresses is indeed prevalent, there were millions of such IP addresses that were mapped to the same default address.
One use of the service was apparently the assignment of a street address to IP addresses that were suspected to be involved with criminal activity. The outcome of the assignment practice as used by the law enforcement agencies was to investigate possible criminal activity at the house. Understandably, this came as quite a surprise to the owners, as wave after wave of different law enforcement agents descended on their house. They’re suing the data-mapping firm.
What went wrong here, from a data ethics point of view? (By “data ethics” here I mean honest communication of what is known and not known about the data.) The mapping firm has many records for which a street address is unknowable. They face a choice. They could mark the case with a code that denotes their own lack of knowledge. They would, therefore, admit that their information is incomplete for any purpose. Or, if they knew that the case lay within a specific country but not exactly where, they could have used a code that denoted that fact itself (“inside the US, not known where”). That code would thereby communicate the level of knowledge they do possess as well as the level they cannot possess.
Instead they chose to impute a specific location. Their imputation, however, was of the grossest type, probably choosing the geographical center of the country. In the best of circumstances, this will be incorrect for all cases but a very few. Perhaps the biggest irony of the story is that after the lawsuit the firm is reported to have changed the location chosen as the default location for cases missing location data. It is reported that they have chosen to impute into all those records a single location that is in the middle of a lake! (One can only imagine what law enforcement agents will do, given this information.)
All data have errors. In a colloquial sense, all data are wrong. But sometimes they’re useful for a given purpose. This occurs when the data are well described and curated in a manner to anticipate multiple uses. Further, the nature of the data is communicated to users in order to minimize uses that are not well supported by the data attributes. Finally, users have a responsibility for appropriate use of data, to know what the data describe well, and to know uses for which the data are ill suited. This requires some attention to detail.
The news story does not elaborate on what is known about the documentation provided users about the nature of the street address information, nor about the sophistication of users. But harm was done to the owners of the default address, harm that could have been so easily avoided with practices common to research data sets. The fact that the “correction” of the default address was to choose a point in a body of water demonstrates how much basic understanding of data ethics is lacking in the data owner.
Big Data problems have been around as long as Big Data–you might be interested in this piece I wrote about an almost comical overabundance of data in the nineteenth century: http://blog.oup.com/2016/10/data-analysis-literature/
The advancements in data analysis has allowed business organizations to enhance their business performance after careful analysis of the data collected. But companies should be mindful of they use this data. If they are not completely certain about the output of a specific packet of data then they shouldn’t incorporate that output in their findings. A standard must be set for business organizations about the accuracy of the data.
The issue here seems to be the conflation of precision and accuracy. Perhaps there are reasons that these spatial data needed to be presented as geographical coordinates (e.g., to maintain consistency with database formatting), but it’s problematic to present spatially precise information when it isn’t accurate, even if the caveats are stated. I wonder if there are existing data standards to address this disconnect — a built-in way to present error in precision, for instance.
Just scary. Hope the big data people can address these very complex and scary ethical issues.