Sunday, July 26, 2020

Cherry Picking of Data: The Achilles' Heel of Bayes' Rule

We are awash in data.  In the age of Facebook/Amazon/Google, the amount of data generated, analyzed, and acted upon is beyond comprehension.  In a 2018 article, Forbes reported that 2.5 quintillion bytes of data were being generated every day.  A quintillion is a 1 with 15 zeros after it!  How do we even picture a quantity of that magnitude?  Here's a YouTube attempt at comparing a quintillion pennies to some familiar objects like football fields, the Empire State Building, the Sears Tower, and so forth.   If each penny were a a single byte of data, THAT's how many bytes of data we would be generating each day with all of our Google searches (5 billion), our cell phones, our snapchat / instagram photos, our interconnected IoT devices, and so on.

But that was way back in the ancient world of 2018.  The pace is only accelerating!

Needless to say, humans are incapable of processing that much data and so the very devices -- computers -- that are generating these data on our behalf are also on our behalf analyzing and acting on this data  through machine learning.  Data Science is thus by far the most far reaching and influential science of today.

Much of data science is based on the use of Bayes' Rule, a fundamental tenet of probability and statistics.  It says that the probability of an hypothesis being true given specific evidence is proportional to the product of two things:  the probability of the hypothesis being true in the absence of evidence and the likelihood that the specific evidence would turn up if the hypothesis were indeed true.  (Strictly speaking, there is another factor involve, a denominator that serves to normalize the measure, but the two factors in the numerator are the crucial ones.  In formula form this is p(H|E)=p(E|H) p(H) / P(E) ).

The main point of this post, however, is not discuss the appropriateness of using Bayes' Rule.  It is entirely appropriate and is indeed agreed to by all rational scientists, either explicitly or implicitly.  In fact, many scientists would say that acceptance of Bayes' Rule is one of the most important criteria for be considered rational at all.

No, the point here is not to discuss Bayes' Rule per se but to discuss its Achilles Heel, the cherry picking of data and what that leads to.  Given that we are awash in data, we have many cherries to pick from.  Given that we are fallible human beings with agendas to pursue, we are all too easily tempted to select data that reinforces the hypotheses we wish to believe and to avoid like the plague any uncomfortable data that may undermine the hypothesis we are (consciously or unconsciously) rooting for.

There's an old saying in computer science, "Garbage In, Garbage Out (GIGO)," meaning, of course, that the results of a computer program are only as good as the data you put in.   If you put into a program data that the program was not designed to handle, you will likely get results back that your program was never intended to produce.

With Bayes' Rule the situation is more subtle.  GIGO still holds for Bayes', of course, but with Bayes' the problem is not with garbage but with pristine, beautiful answers.   Answers that are too beautiful.   For with Bayes' you have the additional problem of getting only pleasing results out because you avoided putting anything unpleasant into the formula in the first place.  (This is related to the problem of overfitting in Machine Learning, but we won't go down that rabbit hole.) 

You don't even have to doctor the data to abuse it for personal gain (weather financial, psychological, emotional or political).   All you have to do is ignore data that doesn't fit your hypothesis or somehow weakens your hypothesis.  All you need to do is cherry pick the data.

When it comes to Fake News and political spin, the informal misuse of Bayes' Rule has reached the point of being a pandemic.  It is probably one of the main reasons we have become such a dangerously polarized society.  Each of two polar opposites ignores the evidence cherished by the other side.   We tend only to look at facts that reinforce our biases.

Bayes' Rule doesn't tell us which data to count as relevant evidence in our deliberations.  But it is unethical to ignore relevant data.   In a court of law, moreover, withholding exculpatory evidence is considered a crime.  This is precisely the crux of the matter in the recent abuses of the FISA courts.
At the start of the COVID-19 pandemic,

What can be done about this?  How can one guarantee that relevant data is not being ignored or simply brushed aside?  Should there be some international organization dedicated to establishing practical norms for ethical data science?  There have been plenty of discussion and articles on the topic.   But the concerns seem to cluster mostly around topics such as data privacy, fairness, social justice, and the general desire to avoid unpleasant outcomes.  But this seems to stem from concerns about taking inappropriate actions based on the data.

We need to separate the actions we take from the analysis of the state of the world.  Fear of taking the wrong actions should not prevent us from taking a sober and honest look at the data -- all the relevant data -- and not cherry picking to avoid unpleasant or inconvenient truths.





Thursday, July 23, 2020

Democrats and the God of the Narrative

People involved together in a crime must necessarily lie to cover their tracks, and to lie effectively enough to convince a jury, they must lie in a coordinated fashion.  They must collude.  But to collude efficiently, they must do so in top-down fashion;  they cannot allow a bottom-up, improvised, uncoordinated sort of lying, for that would be too easily discerned.  The individual liars must not embellish the lies with additional lies of their own.   All the liars must get on the same page and do so quickly.   They must get in lockstep with one another, and this will not happen unless it somehow comes from the top.

When it comes to the criminal courts, we generally don't try groups of people as groups.   We try individuals.  An individual criminal must not only be consistent in his lying -- which is certainly difficult enough -- he must also have corroborating witnesses who are willing to lie on his behalf.  And lying in court is a risky thing.  There are stiff penalties for perjury.

 Civil courts are different.  Organizations can certainly be sued in civil court.  It happens all the time.  Individuals are sued as well.  We have the RICO laws.

 From HG.org, we have the following quotation:
RICO law refers to the prosecution and defense of individuals who engage in organized crime. In 1970, Congress passed the Racketeer Influenced and Corrupt Organizations (RICO) Act in an effort to combat Mafia groups. Since that time, the law has been expanded and used to go after a variety of organizations, from corrupt police departments to motorcycle gangs. RICO law should not be thought of as a way to punish the commission of an isolated criminal act. Rather, the law establishes severe consequences for those who engage in a pattern of wrongdoing as a member of a criminal enterprise.
Consideration of witness tampering is one of the provisions of the RICO statutes.  So, if it can be proved that an organization engaged in bribery in exchange for collusion and fake corroboration on the witness stand, that organization can face civil penalties, and the individuals involved can face criminal prosecution.

But, as I said in the beginning, to lie or collude efficiently, there must be top-down directives.  

Which brings us to the Democratic Party.  It has often been noticed that on many major stories -- especially those dealing with President Trump -- several media outlets uncannily use the same phraseology if not the exact same phrases to communicate the same negative opinion toward Trump's person, beliefs and/or actions.   This cannot be a coincidence.  Given that today's new cycles are so rapid (high frequency, nearly instantaneous response), this coinciding of phraseology can only mean one thing:  It happens due to a top-down communique from a single organization, and the only viable candidate for that one organization is the Democratic National Committee or DNC. 

Fortunately, today's news environment is not limited to the handful of willing colluders, not limited to  CNN, MSNBC, etc., which outlets are undoubtedly receiving talking points on a daily basis from the DNC.   The Internet is brimming with myriad alternative sources.  The secondary sources may not be on distro for the DNCs talking points, indeed, may not even be sympathetic with those points.  But there are plenty who are, and some who have a more respected and permanent in the info world, such as Wikipedia.

But the upshot of all this is that ultimate source of this collusion, the DNC, is not interested in the truth but only in one thing:  power.  When it comes to Republican talking points there tends to be more individualism.  That's why there is a plethora of conspiracy theories on the right but the left seems to be more in lockstep.  The right wing conspiracy theorists are not waiting for talking points from the RNC.  Foxnews may pay attention to RNC talking points, but the individual talking heads at Fox are not shy about disagreeing with the RNC.   It would seem also that the RNC talking points are shaped more by the bottom-up mood of the rank and file than the opinions of a few elites at the top.  

To be sure, the RNC in the past has operated in some ways similar to the top-down approach of the DNC.  But that was before the political earthquake of 2016 called Donald Trump.