Fixing our statistics

December 25, 2017

Last issue, Nature released a very interesting comment entitled "Five ways to fix statistics" [1]. Their premise is very simple: as the debate goes on about how statistics is the one to to blame (more than you would imagine) for poor reproducibility, they asked five influential statisticians to recommend on change to improve science. Obviously, what they found as a common theme in all of them is that we are the one to blame, not the math.


Here I will give my humble opinions on each of the statisticians' statements. I highly recommend reading the text first though. At the end of each comment I'll put what was, in my opinion, "the best quote".


1. Dr. J Leek: Adjust for human cognition


Dr. Leek made a great point that today's researchers are using methods that were designed for small datasets on huge datasets. And the number of data differs too much between different areas of research and it is unbelievable to see the same methods being used over and over. For example, while a study on the "human cognition while under incarceration conditions" will have a very low number of points in a closed monitored study; biologists "normalizing data from chromatin conformation techniques" will have TBs of data.


He argued that interaction is critical. One of the ways to solve methodological problems (besides having a good statistician, of course) is indeed more talk between who created data and who will analyze it (see my previous post "Computational biologists: moving to the driver's seat!"), as science is becoming increasingly more interdisciplinary.


Of course he also points out to the p-value trap. As I am going to talk more about that later, I'll only leave here two papers that contains such discussion [2,3].


Dr. Leek also say that data analysis is also human behavior. For instance, humans tend to go through much more easily on bar charts than calculating angles in pie charts.


Finally, he is "doing this and taking the next step: running controlled experiments on how people handle specific analytical challenges in our massive online open courses (L. Myint et al. Preprint at bioRxiv; 2017)." (really worth reading)


Best quote: "It’s also impractical to say that statistical metrics such as p-values should not be used to make decisions. Sometimes a decision (editorial or funding, say) must be made, and clear guidelines are useful."


2. Dr. B.B. Mcshane & Dr. A. Gelman: Abandon statistical significance


Drs. Mcshane and Gelman, in my opinion, summarize the root of all evil: NHST (null hypothesis significance testing). The researchers often do everything they can to get a certain metric (such as p-value) above a certain threshold. What happens is that the data that actually what appears (read: is shared) on publications is a selected handful of the original amount.


And it gets even worse, since NHST is used to binnarize decisions, such as "had the effect" or "didn't have the effect". That is the worst possible mistake a researcher can do. In very few cases this is done knowingly but most of the cases is simply "well, these high profile papers did this as well...". You can easily demonstrate how bad NHST is when used activelly than passively. I wrote a 1-minute R code that generates random normal distributions with 10, 100, 1000, 10000 and 100000 samples. Each sample number in the plot below is represented by a color. Each sample number was obtained twice (one has a straight line, the other is dashed) and compared using the beloved NHST: Kolmogorov–Smirnov. What happened?:


Kolmogorov–Smirnov Test Results (P<.05 red; P>.05 green backgrounds):
> "p-value for dataset with 10 samples: 0.000457554829072"
> "p-value for dataset with 100 samples: 0.366726443884823"
> "p-value for dataset with 1000 samples: 0.00721312593365186"
> "p-value for dataset with 10000 samples: 0.945279489669248"
> "p-value for dataset with 100000 samples: 0.031850321283197"


You'll say: you picked and chose! I'll go: I ran it 1000 times and picked the run #732 without looking at the others. You'll say: But of course. Why didn't you try method X. I'll go: Recently, when analyzing data from Perturb-seq [4], simply performing Kolmogorov–Smirnov outperformed all published methods for "handling" droplet-based single-cell RNA-seq data. And I have also other things to do! I work!


Best quote: "NHST was supposed to protect researchers from over-interpreting noisy data. Now it has the opposite effect."


3. Dr. D. Colquhoun: State false positive risk, too


Dr. Colquhoun put into words a fact that is very overlooked: That the FPR (false positive risk) is always bigger than the p-value. How bad is it? It depends on "the prior probability of there being a real effect. If this prior probability were low, say 10%, then a p-value close to 0.05 would carry an FPR of 76%. To lower that risk to 5% (which is what many people still believe p-value < 0.05 means), the p-value would need to be 0.00045."


But it is indeed hard to state the prior probability, especially in social sciences. Nevertheless, he points to a very interesting FPR calculator:


Best quote: "Imagine the healthy skepticism readers would feel if, when reporting a just-significant p-value, a value close to 0.05, they also reported that the results imply a false-positive risk of at least 26%. And that to reduce this risk to 5%, you’d have to be almost (at least 87%) sure that there was a real effect before you did the experiment." In other words (actually, a Figure):


4. Dr. M.B. Nuijten: Share analysis plans and results


Dr. Nuijten claims that the scientists should be held accountable for their "bad conventions" (after all, study!). But there are so many issues to create a set of gold standards. She correctly claim that even by chance someone would find the correct path to the correct finding and also by chance would someone use only "good methodology" but end up in a false positive discovery.


In the beautiful words of Dr. Nuijten:


"Planning and openness can help researchers to avoid false positives. One technique is to preregister analysis plans: scientists write down (and preferably publish) how they intend to analyze their data before they even see them. This eliminates the temptation to hack out the one path that leads to significance and afterwards rationalize why that path made the most sense. With the plan in place, researchers can still actively try out several analyses and learn whether results hinge on a particular variable or a narrow set of choices, as long as they clearly state that these explorations were not planned beforehand.


The next step is to share all data and results of all analyses as well as any relevant syntax or code. That way, people can judge for themselves if they agree with the analytical choices, identify innocent mistakes and try other routes."


5. Dr. S.N. Goodman: Change norms from within


As Dr. Goodman says: There are so many fields and issues to consider, as discussed above, that standards would be hard to establish. And if established, they would die easily. Dr. Goodman remembered us that the Statistical Association broke from the tradition against the misuse of p-values, but they cannot fix the culture from other fields. Importantly, just above this sentence he mentions that "Statisticians can be allies". Which again, brings back my point that science is intrinsically interdisciplinary. Although... statisticians often fight as well:


I'll state the best quote before the ending for a reason:

Best quote: "Many scientists want only enough knowledge to run the statistical software that allows them to get their papers out quickly, and looking like all the others in their field."


And the end of the phrase is where things get dirty. Often you will see publications where someone used the same "analysis pipeline" of an already accepted paper. And the reviewer wouldn't even bother there, since it was already reviewed. And turns the malpractice a vicious circle.


And, as Dr. Goodman argues, the only way to break this wheel is whether the funders, journals and leaders of various fields are willing to take up the challenge.


Finally, I'll say goodbye with a link to a very nice series by Nature entitled: "Points of Significance - Statistics for Biologists" [5]. And an interesting paper by Georgescu & Wren on an "algorithmic identification of discrepancies between published ratios and their reported confidence intervals and p-values" [6].


1. Leek J. et al. (2017) Nature, 551:557-559.
2. Nuzzo R. (2014) Nature, 506:150–152.
3. Woolston C. (2015) Nature, 519: 9.
4. Datlinger P. et al. (2017) Nature Methods, 14:297–301.

6. Georgescu C. & Wren J.D. (2017) Bioinformatics, btx811.


Figure Sources:
Fig.2. Script bellow


Source Code:
1-minute R script to create the data for the Fig.3. And yes, there is a reason I didn't set a "seed". Run it several times and see what happens.

# Distributions
norm10.1 <- rnorm(10)
norm10.2 <- rnorm(10)
norm100.1 <- rnorm(100)
norm100.2 <- rnorm(100)
norm1000.1 <- rnorm(1000)
norm1000.2 <- rnorm(1000)
norm10000.1 <- rnorm(10000)
norm10000.2 <- rnorm(10000)
norm100000.1 <- rnorm(100000)
norm100000.2 <- rnorm(100000)


# Plotting <- "./plot.pdf"
pdf(, width = 7.0, height = 5.0, paper = 'special')
par(mar = c(2, 2, 0.5, 0.5))
plot(density(norm10.1), lwd = 1, lty = 1, col = "black", xlim = c(-4, 4), ylim = c(-0.01, 0.5), main = "", xlab = "", ylab = "")
lines(density(norm10.2), lwd = 1, lty = 2, col = "black")
lines(density(norm100.1), lwd = 1, lty = 1, col = "blue")
lines(density(norm100.2), lwd = 1, lty = 2, col = "blue")
lines(density(norm1000.1), lwd = 1, lty = 1, col = "red")
lines(density(norm1000.2), lwd = 1, lty = 2, col = "red")
lines(density(norm10000.1), lwd = 1, lty = 1, col = "green")
lines(density(norm10000.2), lwd = 1, lty = 2, col = "green")
lines(density(norm100000.1), lwd = 1, lty = 1, col = "orange")
lines(density(norm100000.2), lwd = 1, lty = 2, col = "orange")
plot.legend = c("10 samples", "100 samples", "1000 samples", "10000 samples", "100000 samples")
plot.col = c("black", "blue", "red", "green", "orange")
legend(2, 0.5, plot.legend, col = plot.col, border = "black", lty = 1, lwd = 1, cex = 0.8, pt.lwd = 1)


# Testing
ks10 = ks.test(norm10.1, norm10.2, alternative = "two.sided")
ks100 = ks.test(norm100.1, norm100.2, alternative = "two.sided")
ks1000 = ks.test(norm1000.1, norm1000.2, alternative = "two.sided")
ks10000 = ks.test(norm10000.1, norm10000.1, alternative = "two.sided")
ks100000 = ks.test(norm100000.1, norm100000.2, alternative = "two.sided")
print(paste("p-value for dataset with 10 samples:", ks10$p.value, sep=" "))
print(paste("p-value for dataset with 100 samples:", ks100$p.value, sep=" "))
print(paste("p-value for dataset with 1000 samples:", ks1000$p.value, sep=" "))
print(paste("p-value for dataset with 10000 samples:", ks10000$p.value, sep=" "))
print(paste("p-value for dataset with 100000 samples:", ks100000$p.value, sep=" "))



Share on Facebook
Share on Twitter
Share on LinkedIn
Please reload

Please reload

Please reload


© 2017 by Eduardo Gade Gusmao