Freitag, 27. November 2015

The Four Horsemen of Statistics



When in the Book of Revelations the Lamb of God opens the first four of seven seals on the scroll in God’s right hand four figures emerge on white, red, black and pale horses. The riders represent conquest, war, famine, and death. They are well known through their group name; the Four Horsemen of Apocalypse.

In our technological and data-driven world virtually everyone needs statistics in one form or another. However, most people are not initiated into the subtleties and dangers of using statistics on real data. The little bit of learning in your undergrad class is in no way adequate. This is why it is extremely important to have a statistician in your enterprise or to be able to fall back on a statistical consultant. But I digress…

In this series of four posts I want to introduce a new concept; the Four Horsemen of Statistics. Four concepts/situations where great danger for the uninitiated lies ahead. To keep the audience captivated I will not disclose the list right now, but start with the first Horseman.

The complications of multiple testing probably ruins the credibility of more publications than any other statistical concept. This is most beautifully illustrated by the essay of John P.A. Ioannidis from 2005 with the intriguing title: «Why most published research findings are false». Ioannidis, in examining the causes of errors in research findings considers multiple testing as a major factor.

Though not fully intuitive multiple testing can be explained in few words. Whenever you perform a statistical test, you allow for a certain amount of error. Performing additional tests dramatically accumulates this «allowed» error. Unfortunately, this «allowed» error is necessary for the logic of testing; without it there would never be a decision. This error has many names, like significance level, type I error, etc. and is often indicated by the Greek letter α.

To give an indication of the magnitude of this situation: assume that you choose a significance level of α=5%, in which case a single test has the chance of a false positive, i.e., finding something, when there is nothing, is 5%. Performing a single additional test at the same level of significance will increase the probability for at least one false positive to 9.75%. When we perform 13 tests, the chance of having at least one wrong test result is an overwhelming 50%.

Fortunately, there is a solution available in dealing with this issue. 

It is quite unspectacularly called multiple testing correction. The idea is to adjust the level of significance α, so that the effect of the multiple tests on the probability of making false positive decisions is eliminated. In our above example this would mean that for the two tests performed we do not use 5% as a significance level, but instead divide it by the number of tests, i.e., two, yielding a new α of 2.5%. When we now compute the chance of observing a false positive, we get a mere 4.9% as intended.

There is, however, an ultimate danger that when having performed a couple of thousand tests,

The uncomfortable fact remains that having performed a couple of thousand tests, the level of significance becomes infinitesimal, leaving researchers desperately trying to find something to publish, with nothing significant at all.

Soon I will present the second Horseman of Statistics, right here…

Mittwoch, 4. November 2015

Bar plots are nonsense

I was once again utterly stupefied by the amount of bar plots I had to endure during the CC-PM retreat in the beautiful Kartause Ittingen last weekend. Like this one...
from Beaton et al., 2015, Mol Met, dx.doi.org/10.1016/j.molmet.2015.08.003
OK, to be honest that one is not from last weekend - but there were many like it. If you don't believe me, take a stop watch and check how long it takes you on Google Scholar to find one of these useless bar plot - usually less than a minute inside the life sciences.

What is bad about these plots, you ask? Well, put simply, they couldn't be more misleading. There are several issues with this nonsensical way to represent different samples of measurements, like for instance the amount of 14C-Clucose per well.

1. Spread/variation of data versus precision of estimation

The general goal of your average PhD student at a scientific conference, retreat or whatever these events might be called nowadays, is to show that a group of measurements she has done on a control is less (or more) than a group of measurements she has done on a sample, which was in some way disturbed from being a control - usually coined treatment.
Finally, a difference between two groups of measurements is qualified using a statistical test, for instance a t-test, if your data is really nice, or a Wilcoxon rank-sum test, if your data is kind of naughty. However, it is - at least from a marketing perspective - useful to find a way to visualize your results in some way.
Now, there are two things you might want to show when illustrating a group of observations:
  1. The spread/variability of the group.
  2. How good you were in estimating some kind of summary of a group, i.e., the mean value.
While the first case might seem intuitive, the latter might not. However, often we replace groups of observations with a summary measure. If we do, then again we often use the mean value - or in coloquial terms average. However, computing the mean value of a sample is generally understood to be an estimation of the population mean. As most of my readers will know, the more data we have (or the larger your sample is) the better we can estimate the population mean. The precision of which is most appropriately indicated by using a confidence interval around the sample mean. Note, that it is nor the standard error, but the confidence interval, which in its approximative form spans twice a standard error in each direction! Nevertheless, most barplots yield standard error bars...

Anyhow, adding either a confidence interval or a standard error to a mean value has no descriptive power for the distribution of the data - or variability, or spread!

2. Bar plots cannot show you differences 

 Let's look at the following example:


On the left we have a selection of six groups each with 20 observations. Clearly, these groups are not the same when we look at the scatter plot. However, when using a bar plot it seems that everything is the same in these groups. Even the standard error bars indicate no difference. Probably, we messed up the experiment or something.
If instead, we use the much more useful box plot, we immediately identify different groups. Even more forensic is the use of violin plots, which show the mirrored probability density of the data and as such allow for the identification of bi- or multimodal distribution of the data.

Try it yourself on https://stekhoven.shinyapps.io/barplotNonsense


3. There might be a bright future

I actually have to be honest to you once more, the first chart I found, wanting to show the distribution of multiple groups of measurements was this one:
from Sonay et al., 2015, Genome Res, doi/10.1101/gr.190868.115
Not only, this is a great way to indicate the difference between multiple groups - using a combination of violin and boxplots (well, you have to pay attention!) - but also the author uses Hadley's ggplot2 ... so maybe the is still a bright future ahead! I am convinced of it!