It’s safe to say, when you boil it down, almost all of my clients come to me because of p-values. Sometimes it’s because a bunch of studies falsely state a chemical/product is toxic at human relevant concentrations (when they really are safe). Sometimes it’s because they had a third party laboratory run some safety tests, and the third party laboratory lab misanalyzed the data, resulting in a significant p-value when there isn’t really a difference. Sometimes it’s because a study was published in the literature and it was completely bonkers. Regardless, in all of these situations a single number led someone to conclude there was a serious issue when there really wasn’t. And it’s all because of a p-value.
It’s horrific to think that single number, a single statistic, determines so much. The p-value determines if a study gets published (if the findings aren’t statistically significant, it’s harder to publish the study). The p-value determines if a chemical or product is deemed safe, effective, or both. The p-value determines if a reporter writes a story about your study. Get enough statistically significant p-values and your star burns brighter — you get more grants (as an academic), you get more accolades, you get more papers published, you get awards, you get promoted — it’s a feed-forward cycle.
But really — what is this thing, this p-value?
In this post I’m going to discuss what most people think it is — the myth of the p-value. Said another way — I’m going to talk about what a p-value isn’t.
Then I’m going to discuss what a p-value actually is.
And from there, then I’ll discuss some of the problems that our over-reliance on p-values has created. If you’re already convinced that you need to move beyond p-values, then you may want to see my article in Towards Data Science on breaking the p-value habit. I’ll also have more blog posts in the future on going Beyond P-Values, but if you can’t wait, reach out and we can chat about it sometime.
Where Are These P-Value Thingies Found?
I’m going to assume you’re completely new to this p-value thing. If so, you’re probably wondering, “where in the world would I find these things you’re ranting about?”
So, here’s what ya do: go pick up a scientific paper. Don’t worry, I’ll wait. Got one? Good.
P-values aren’t hard to find in the wild once you know where to look. It’s like finding rabbit scat near your garden — once you know what you’re looking for you can’t stop seeing them.
So, take a look at the scientific paper you found. Sometimes, not always, p-values are hiding in the Abstract of the paper. Look for something that says (p < 0.05) or (p = some number less than one). You are almost guaranteed to find them in the Results section. You may even find them in the figure legends, or in tables. I circled a column of p-values in the table in the following image (the paper is a random paper I just happened to be reading for a case: Michel et al (2019) Pediatrics 143(2): e20173963).
Hopefully you found some. If not, try another paper. The Michel et al. paper has a lot of p-values in it.
It’s safe to say, when you boil it down, almost all of my clients come to me because of p-values.
The Myth of the P-Value
The problem with the p-value is that it has a mythical status, and unfortunately, statistics textbooks continue to perpetuate these myths. Here are just a few of the p-value myths I’ve encountered in my journey as a toxicologist cross-trained in statistics:
Myth 1: In toxicology, the p-value is equal to the probability that chemical/product is safe
P-values are an integral part of the null hypothesis statistical testing (NHST) regime. The idea is that when you’re doing a statistical test you’re trying to see if the treatment group is different from an untreated/control group. In toxicology, we are trying to see if the chemical/product is toxic. The null hypothesis here is that the chemical/product is acting the same as the untreated/control group. That means the alternative hypothesis is that the chemical/product is toxic.
A lot of people, including toxicologists, believe that the p-value is equal to the probability that the chemical/product is safe. For statistical significance, most toxicologists want to maintain a 5% false positive rate (we call this alpha in statistics). Thus, toxicologists will say, if the p-value is less than 5%, then the result is significant. Again — because they believe the p-value is equal to the probability that the chemical/product is safe, if the probability is less than 5%, then the toxicologist who believes this would believe there is a less than 5% chance the chemical is safe, and therefore, the chemical is toxic.
Let that sink in for a second.
Under this myth, if the probability that the chemical is safe is less than 5%, then the toxicologist will deem it toxic.
The problem is this is simply a myth. P-values aren’t the probability that a chemical/product is safe. P-values have absolutely nothing to do with the probability that the null hypothesis is true.
Myth 2: The p-value reflects the effect size or the importance of a result
Okay, this is a big nope. I cannot even begin to tell you the number of times I’ll see scientists do things like 1 star means p < 0.05, 2 stars means p < 0.01, 3 stars means p < 0.001. There’s nothing inherently wrong with denoting these different levels; however, there’s no reason to believe a 3-star statistical result is better than a 1-star statistical result — these aren’t Michelin Stars, and statistical results aren’t restaurants.
But that’s how it’s treated! It seriously needs to stop.
Myth 3: P-values reflect the level of evidence against the null hypothesis
Having a p-value of 0.70 does not mean that there is stronger evidence that the null hypothesis (e.g., that a chemical is safe) is true. Likewise, a p-value of 0.00000000001 does not mean that there is stronger evidence that the null hypothesis is false (e.g., that a chemical is toxic). It just doesn’t work this way.
What Does a P-Value Actually Mean?
I’m going to quote the American Statistical Association, because, really, they said it best.
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.American Statistical Association Statement on P-Values
The “informally” part is important and cannot be understated. This is the simple, informal definition. The full definition, which is what we all need to understand, is actually far more nuanced. And nuance is important.
P-Values Indicate The Degree to Which the Data are Incompatible with the Model
This is really important. Frequentist statistical tests are mathematical models with a lot of assumptions attached that are focused on that pesky null hypothesis (because we are talking p-values we are always talking about Frequentist models; Bayesian models are a completely different beast). You cannot remove the null hypothesis from the statistical test — it’s part and parcel to the whole package.
Think back to our earlier chat about the null hypothesis — in toxicology this is typically the hypothesis that the control/untreated group is no different from the treated group — in other words, the chemical/product is safe.
So what I’m saying here is simply this: the statistical test has the null hypothesis comparison built-in, and it has a bunch of assumptions also attached.
Here’s the kicker: the p-value reflects the incompatibility of the data with the model.
If you violate the assumptions — you could have a smaller p-value.
If the data do not support the null hypothesis — you could have a smaller p-value.
Bottom-line: if the model is not a good fit for the data, then the p-value will be small.
So, you might be thinking, “Dude, does that mean if I pick a really crappy model that’s not right at all, I could game the system and get a small p-value, and make everyone think that a chemical/product is actually toxic when it’s not?”
My response: “Yeaaaahhhhh…I don’t recommend doing that, b/c it’s unethical, but yes, that would likely be the result.”
And someone, not you, overhearing this says, “Oh, I’m so totally doing this so I can get a paper published.”
Okay, that’s a bit of an exaggeration, and I would hope no one is unethical enough to intentionally choose the wrong statistical test.
But you know what does happen…all…the…time? Scientists inadvertently choose the wrong test all the time. I don’t believe they have any ill will, and I don’t believe they’re trying to game the system. I believe they honestly believe they are doing the right thing. But, ya know, for whatever reason, they didn’t feel like they should talk with a statistical expert about it…
So Has Our Inappropriate Reverence for P-Values Caused Any Harm?
Yes. Yes it has. Lots of harm.
Right now, we’re stuck in a reproducibility crisis. What does that mean?
It means that we have spent billions — yes billions — of taxpayer money (around the world) chasing what we affectionately call “noise” (in the US it’s at least in the millions in taxpayer money on just toxicology alone, probably billions depending on how far back you go). We have tons of studies published today, in just toxicology, that can’t be reproduced. We have billions of dollars spent just on toxicology/safety studies where inappropriate statistical tests were used, or where assumptions of the test were violated (and I’m not even going to discuss the sampling bias issue in this post).
Our chemical regulations, which generally require standard toxicity tests using Organisation for Economic Cooperation and Development (OECD) guidelines which force companies to use p-values, revolve around p-values. In my experience I see chemicals flagged as toxic under OECD guideline tests when they really aren’t because of strict adherence to the p-value, without thinking about the biology (there are other issues but I won’t discuss those today).
The Problem is that Biologists/Toxicologists Have Stopped Being Biologists/Toxicologists…
And that’s the bigger issue in toxicology — most toxicologists stop thinking once they get a p-value. They don’t go the extra step to ask, “is that effect size associated with this p-value actually biologically meaningful?”
For instance, I had a client send me some data once to say their chemical was actually health protective! They were looking at a measure called ALT (it’s a biomarker for liver injury). And they said their chemical had a lower ALT than the control group, so therefore their chemical was health protective. They had a p=0.01 and that proved it! As a Bayesian, I know that I need to look beyond just that control group, and look at the historical control range of ALTs in all untreated animals of that same species. Having done graduate work in the liver I already knew the normal range of ALTs is pretty wide. When I re-analyzed the data using the historical controls, I was able to show that there was actually no effect of their chemical — the ALTs for the treated animals was solidly within the distribution for the over 1,000 control animals we had data for.
See, the biggest problem with p-values is that they’re based on a sample of animals. When doing Frequentist statistics (the type that generate p-values) we *hope* that the sample represents the population. If the sample represents the population well, then our inferences are valid. If they don’t, then our inferences are invalid. Unfortunately, when you have a small number of animals in the control group, they won’t accurately represent the population very often.
What the client should have done is a simple exploratory analysis. They should have looked at more data and look at the histograms before even attempting a statistical test. Had they compared the histogram of historical controls and current controls (or even better, doing some Bayesian updating of the historical data), and then compared that to the treated distribution, they would have seen the distributions are completely overlapping. They would know right away — nope, don’t need to run a statistical test because they are not at all different.
Statisticians will ask you something consistently: what’s the biologically meaningful difference? If you don’t know that now, don’t bother running the statistical test, because you won’t be able to interpret the results properly.
Remember: a significant p-value just means the data are not compatible with the model. It’s your job to figure out why. If you’re a toxicologist, that means seriously thinking about WHY the p-value is so small.
Better yet — call a Bayesian statistician, like me, BEFORE you design the study, so we can ensure you get the most bang for your buck. But whatever you do, stop, just stop, using p-values.
Friends don’t let friends stop at the p-value (good friends don’t let you calculate the p-values).