What does the p-value really tell us?
Welcome back! If you missed the previous installment, you can find it here.
Continuing the series, we’ll be talking about the p-word. That’s right, “p-values”. A concept so central to statistics, yet one of the most often misunderstood.
Not too long ago, the Journal of Basic and Applied Psychology straight up banned p-values from appearing in their articles. This and other controversies about the use and interpretation of p-values led the American Statistical Association (ASA) to voice their thoughts on p-values; writing such recommendations for the fundamental use of statistics was unprecedented for the organization.
Part of the confusion stems from the complacency with which we teach p-values, leading to blind applications of p-values as the litmus test for significant findings.
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
– George Cobb
Snide comments aside, let us unpack what a p-value does and does not tell us. First, take a look at the following twenty sets of randomly generated data:
Each one of the boxes contains 50 points whose x-y coordinates were randomly generated from a normal distribution with mean 0 and variance 1. Yet, we see that there is occasionally a set of points that appears to have a trend, such as the one highlighted in red, which turns out to exhibit a correlation of 0.45. If even random noise can display patterns, how do we discern when we have a real mechanism influencing some response versus simply random data? P-values provide this support by giving us a measure of how “weird” an observed pattern is, given a proposal of how the world works.
More formally, the definition of a p-value is “the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value” (taken from the ASA). Note that this says nothing about the real world. Rather, it measures how much doubt we have about one particular statistical view of the world. If our null hypothesis were true and our model of the world pretty accurate, a “statistically significant p-value”, means that something unlikely has happened (where unlikely could be defined as a 1 in 20 chance). So unlikely that it throws significant doubt into whether that null hypothesis is a very good model of the world after all. It is important to note, however, that this does not mean that your alternative hypothesis is true.
Conversely, an insignificant p-value is not an indication that your null hypothesis is true. Rather, it suggests a lack of evidence as to whether your null hypothesis is an inaccurate model of the world. The null hypothesis may well be accurate or you may simply not have collected enough evidence to throw significant doubt on an inaccurate null hypothesis. A common trap is to argue for a practical effect because of some perceived pattern even though the p-value is insignificant. Resist this temptation, as the insignificant p-value indicates that the pattern is not particularly unusual even under the null hypothesis. Also resist the temptation to state or even imply that the insignificant p-value indicates (a) there is no effect; (b) there is no difference; or (c) the two populations are the same. Absence of evidence is not evidence of absence.
Ultimately, the p-value is only one aspect of statistical analyses, which is, in turn, only one step in the life-cycle of science. P-values only describe how likely it might be to get data like yours if the null hypothesis were really how the world worked.
There are, however, some practices that can supplement p-values:
- Graph the data. For example, how different do two groups look when you make box plots of their responses? How much data do you really have? Large sample sizes can help elucidate significant differences (a topic we will dive into more in a later installment about statistical power). Are there unusual observations?
- More formally, estimate the size of the effect that you are seeing (e.g. via a confidence interval). Is it a potentially large effect that is not significant or a very small effect that is statistically significant? Is the effect size you see relevant to potential real-world decisions? A 95% confidence interval of [0.01, 0.05] may be significantly different from zero, but if that interval represents say the increase in °C of river temperature after a wildfire, is it a relevant difference to whatever decision is at hand?
- Conduct multiple studies testing the same hypothesis. Real world data is noisy. Each additional study allows you to update prior information and possibly provide more conclusive support for or against a hypothesis. This is, in fact, the basic idea behind Bayesian statistics, which we do not have the space to cover here, but go here for an introduction on the topic.
- Use alternative metrics to corroborate your p-values, such as likelihood ratios or Bayes factors
Hopefully, we have provided significant enlightenment on p-values. Next time, we will continue thinking about p-values, specifically the risks involved with testing multiple hypotheses in the same analysis.
Thanks for reading and hope you will join us for the next installment in a few weeks!
Etz, A. (2015) “Understanding Bayes: A Look at the Likelihood.” URL: https://alexanderetz.com/2015/04/15/understanding-bayes-a-look-at-the-likelihood/
Kurt, W. (2016) “A Guide to Bayesian Statistics.” URL: https://www.countbayesie.com/blog/2016/5/1/a-guide-to-bayesian-statistics
Trafimow, D. and Marks, M. (2015) “Editorial.” URL: http://www.tandfonline.com/doi/abs/10.1080/01973533.2015.1012991
Wasserstein, R.L., and Lazar, N.A. (2016) “The ASA’s statement on p-values: context, process, and purpose.” URL: http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108
Past Articles in the Series
Bonus Article: A different type of p-value…
I am working with E. Ashley Steel and Rhonda Mazza at the PNW Research Station to write short articles on how we can improve the way we think about statistics. Consequently, I am posting a series of five blogs that explores statistical thinking, provides methods to train intuition, and instills a healthy dose of skepticism. Subscribe to this blog or follow me @ChenWillMath to know when the next one comes out!
Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.