Sunday, July 26, 2020

Cherry Picking of Data: The Achilles' Heel of Bayes' Rule

We are awash in data.  In the age of Facebook/Amazon/Google, the amount of data generated, analyzed, and acted upon is beyond comprehension.  In a 2018 article, Forbes reported that 2.5 quintillion bytes of data were being generated every day.  A quintillion is a 1 with 15 zeros after it!  How do we even picture a quantity of that magnitude?  Here's a YouTube attempt at comparing a quintillion pennies to some familiar objects like football fields, the Empire State Building, the Sears Tower, and so forth.   If each penny were a a single byte of data, THAT's how many bytes of data we would be generating each day with all of our Google searches (5 billion), our cell phones, our snapchat / instagram photos, our interconnected IoT devices, and so on.

But that was way back in the ancient world of 2018.  The pace is only accelerating!

Needless to say, humans are incapable of processing that much data and so the very devices -- computers -- that are generating these data on our behalf are also on our behalf analyzing and acting on this data  through machine learning.  Data Science is thus by far the most far reaching and influential science of today.

Much of data science is based on the use of Bayes' Rule, a fundamental tenet of probability and statistics.  It says that the probability of an hypothesis being true given specific evidence is proportional to the product of two things:  the probability of the hypothesis being true in the absence of evidence and the likelihood that the specific evidence would turn up if the hypothesis were indeed true.  (Strictly speaking, there is another factor involve, a denominator that serves to normalize the measure, but the two factors in the numerator are the crucial ones.  In formula form this is p(H|E)=p(E|H) p(H) / P(E) ).

The main point of this post, however, is not discuss the appropriateness of using Bayes' Rule.  It is entirely appropriate and is indeed agreed to by all rational scientists, either explicitly or implicitly.  In fact, many scientists would say that acceptance of Bayes' Rule is one of the most important criteria for be considered rational at all.

No, the point here is not to discuss Bayes' Rule per se but to discuss its Achilles Heel, the cherry picking of data and what that leads to.  Given that we are awash in data, we have many cherries to pick from.  Given that we are fallible human beings with agendas to pursue, we are all too easily tempted to select data that reinforces the hypotheses we wish to believe and to avoid like the plague any uncomfortable data that may undermine the hypothesis we are (consciously or unconsciously) rooting for.

There's an old saying in computer science, "Garbage In, Garbage Out (GIGO)," meaning, of course, that the results of a computer program are only as good as the data you put in.   If you put into a program data that the program was not designed to handle, you will likely get results back that your program was never intended to produce.

With Bayes' Rule the situation is more subtle.  GIGO still holds for Bayes', of course, but with Bayes' the problem is not with garbage but with pristine, beautiful answers.   Answers that are too beautiful.   For with Bayes' you have the additional problem of getting only pleasing results out because you avoided putting anything unpleasant into the formula in the first place.  (This is related to the problem of overfitting in Machine Learning, but we won't go down that rabbit hole.) 

You don't even have to doctor the data to abuse it for personal gain (weather financial, psychological, emotional or political).   All you have to do is ignore data that doesn't fit your hypothesis or somehow weakens your hypothesis.  All you need to do is cherry pick the data.

When it comes to Fake News and political spin, the informal misuse of Bayes' Rule has reached the point of being a pandemic.  It is probably one of the main reasons we have become such a dangerously polarized society.  Each of two polar opposites ignores the evidence cherished by the other side.   We tend only to look at facts that reinforce our biases.

Bayes' Rule doesn't tell us which data to count as relevant evidence in our deliberations.  But it is unethical to ignore relevant data.   In a court of law, moreover, withholding exculpatory evidence is considered a crime.  This is precisely the crux of the matter in the recent abuses of the FISA courts.
At the start of the COVID-19 pandemic,

What can be done about this?  How can one guarantee that relevant data is not being ignored or simply brushed aside?  Should there be some international organization dedicated to establishing practical norms for ethical data science?  There have been plenty of discussion and articles on the topic.   But the concerns seem to cluster mostly around topics such as data privacy, fairness, social justice, and the general desire to avoid unpleasant outcomes.  But this seems to stem from concerns about taking inappropriate actions based on the data.

We need to separate the actions we take from the analysis of the state of the world.  Fear of taking the wrong actions should not prevent us from taking a sober and honest look at the data -- all the relevant data -- and not cherry picking to avoid unpleasant or inconvenient truths.





No comments:

Post a Comment