I am going to take a small break from my writing on the theory of everything to address a few thoughts I had this week about the uses and abuses of big data. Why have a blog if you can’t write about what is on your mind? I may return to this topic to further flesh out some of my ideas.
I was sent a paper this week that was addressing the uses and misuses of big data in medicine. I am not going to link to the paper for several reasons. First, I am not a medical doctor so the depth of clinical arguments made in the paper are beyond my training as a theoretical physicist and I try to refrain from commenting on detailed arguments outside my field. Second, I had many disagreements with the arguments presented in the paper about big data, and although the paper was published more than two years ago, I did not want to write a detailed critique or call out the author. Finally, I am mostly interested in using the ideas I had while reading this paper as a springboard into my own thoughts about big data in our world and in particular, in the world of physics.
The only comment I will make is to remind people of a term in computing called GIGO, or “garbage in, garbage out.” I don’t think anyone is going to seriously argue that simply “throwing computers at a problem,” no matter which methods are employed, will always lead to sensible solutions. Good methods can be brought to bear on ill-defined problems, and well defined problems can deny progress to the wrong methods. Big data is no exception, and cherry picking situations where data scientists needed more input on the meaning and constraints of their data sets as evidence that big data methods are flawed or somehow ill-suited to whole classes of problems, such as medical ones, seems to me to be making the opposite argument.
The question that came to mind is, “Are we really living in the Age of Big Data?” At first, this question seems ridiculous and obvious. I can hear you now, “Of course we do. Big data literally touches on every aspect of modern life. Economics, retail shopping, government, science, sports, technology. You name it, somewhere someone is applying big data techniques to the problem.” Fair point. But, I would also argue that the day after the invention of the steam engine would be too soon to say we had entered the industrial revolution, and the information age was more than just the invention of the transistor.
We are at an inflection point with big data.
Famously, the author William Gibson said, “The future is already here — it’s just not evenly distributed.” Clearly, there have been some breakout uses of big data as of this writing (no news there), and I would imagine that in another generation we will wonder what life was like without it. By the way, that sentiment could go in many ways. There will also be those of us who yearn for a time before big data for a variety of reasons. For instance, much of the history of science fiction has technology being used to constrain as much as propel. Once the production, collection, and distribution (or sale) of big data sets becomes the foundational part of the data ecosystem, these sorts of questions will be moot. It is not difficult to envision a future where the creation of custom data sets is the primary function of our interactions with the world.
I started thinking about this as I was reading because the medical article made a reference to another article called, “The Unreasonable Effectiveness of Data,” which made me chuckle. The title of this article is a play on the title of a very famous paper in physics presented by the great Eugene Wigner at NYU in 1959 called, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences.” The original physics paper is a thoughtful and careful discussion of the so-called “miracle” that mathematics describes the laws of physics so well. Because of this, Wigner’s paper is required reading in many courses on the philosophy of science and the foundations of mathematics. Read it, it’s worth your time.
The subsequent big data paper was written by three researchers at Google in 2009, including the great Peter Norvig (my apologies for not knowing the other authors beforehand, but as I have said, I am a theoretical physicist and not a computer scientist with the caveat that Norvig’s books on artificial intelligence made him known to me long before I read this particular paper). In this paper, the researchers put forward some ideas for using large, unstructured, unlabeled, and noisy data sets to build high-quality models. Although the term “big data” was in academic use before this paper, and I have no real interest in tracking the etymology of the term or how it is used today, it seems that research into what would be recognizable as modern big data really started to take off at this time.
Which brings me to my point, is there really an unreasonable effectiveness of data? I am going to say, with no hint of irony, that it depends on what we mean by data. For instance, everything we know about the universe, literally every measurement ever made forms, perhaps, the largest of data sets and it has been, so far, unreasonable effective at understanding the nature of reality. Particle colliders like the Large Hadron Collider at CERN produce immense data sets that are routinely mined in model dependent and model independent ways.
Now, I can hear you objecting again, “that’s clearly not what we mean by big data.” Again, I understand. So I ask you, for whatever problem you are interested in, is there a hypothetical data set that could be used to find what you are interested in? I would be surprised to learn if the answer is no. (And if it is “no,” then I wonder if your question is a scientific one or even an empirical one). I think that the dividing line between reasonable and unreasonable effectiveness of data lies in data that can be harvested from the sources that are currently available. As our world begins to organize itself around big data, access to unimaginable data sets of the past will become everyday. In terms of the industrial revolution and the information age, the age of big data has just begun, so I ask again, where is the unreasonable line to be drawn?
No, I don’t think it’s the problem of unreasonably effective data, but rather it’s about extracting data in a way that is as effective as it is ethical. It’s about creating an age of ethical big data where we can safely and anonymously create, mine, and distribute data in a way to understand our world to build a better one.