Monthly Archives: October 2017

Simulations toolbox: You’ve got the power

Putting it all together

Welcome to the final installment in my series on understanding statistics! Thanks for following along! I would also like to start this post by thanking E. Ashley Steel and Rhonda Mazza for guiding me throughout this project; I could not have produced these without them.

Before we dive in, I encourage you to read all the previous posts in this series, as they will all be helpful for appreciating this last post:

  1. Your Brain on Statistics
  2. Patterns from Noise
  3. P-hacking and the garden of forking paths
  4. Simulations toolbox: The beauty of permutation tests

We finish by exploring power analyses, which help us quantify and understand our ability to detect statistically significant effects.

Formally, the definition of “power” is the probability that you detect a significant effect when there is one. In practice, power analyses are commonly used to:

  1. Determine the number of samples required to detect an effect with a given power
  2. Calculate power given an effect size and sample size

Let us start with a hypothetical comparison of two populations of seedlings in which one received fertilizer and one did not.  We know that heights are usually normally-distributed.  Past data suggest that the standard deviation of heights for these seedlings is 40. We can simulate any difference of interest and samples of any size to estimate power.

Imagine that one population has a mean of 180 cm and the fertilizer really does lead to an increase of 20 cm. We have the budget to measure 20 seedlings from a treated and an untreated population.  How likely are we to detect this difference of 20 cm in height between treated and untreated trees?  A power analysis based on simulated data will tell us, in advance, the chances of rejecting the null hypothesis in this situation.

Basic steps of determining power:

  1. Select the effect size of interest (e.g. difference of means of 20)
  2. Select the sample size of interest (e.g. 20 samples)
  3. Select the alpha level of interest (i.e. α, or the probability of rejecting given the null hypothesis is true; often .05)
  4. Make an educated estimate of parameters relevant to simulating samples and performing statistical tests (e.g. population means, standard deviations, etc.)
  5. Simulate samples of the populations
  6. Perform relevant statistical test (permutation tests could be useful!)
  7. Record the p-value of the test
  8. Repeat steps 5-7 a thousand times. Of the thousand times, how many of the p-values were significant at the chosen alpha level? That is the power level.

Applying these steps to our example above, we find:

Power_NormalDistribution

Here, we display all the p-values from 1000 statistical tests of 1000 simulations of 20 observations from each of the two populations above. We see that we would have about a 1 in 3 chance of detecting a difference in means of 20 between these two populations if we were to take 20 samples for our study. Not very good odds if you plan to gamble significant resources to collect the data. And, a great reminder that when you do not see significance, it does not mean that there is no difference. In this example, 2/3 of the time we do not get a significant result even though there is a population level difference of 20!

We might want to know how many samples it would take to achieve, say, an 80% success rate of detecting this difference (i.e. a power of 0.8). By redoing the above analysis for different sample sizes, we can see:

PowerVsSampleSize_NormalDistribution.png

It would take around 60 samples to detect that difference of 20 in the population means 80% of the time. So, before we spend any resources on doing the experiment, power analysis gives us an idea of how feasible it is to expect to detect significant trends if they really are there. We can start asking logistical questions like “Is it possible to obtain 60 samples to support a sufficiently-powerful statistical test?”

This is all well and good in theory, but you might be wondering “How do we know what effect size we want to try and detect?” or “What do we set as the standard deviation in these analyses?” Those would be great questions. After all, if we knew these values, we would not need to conduct a study! Here is where educated estimates come in, whether that be theory and common sense (e.g. an effect size of X would be of ecological significance), previous studies, pilot studies, or regulations (e.g. a treatment must have at least Y units of impact on Z).

In fact, let us say you have some data from a small pilot study, which is plotted below:

SampleData

This could be the height of a population of plants, which you assume to be normally distributed, and you are interested in the effect of a different new fertilizer on the plants’ average height. A change in average height of 10 cm would be considered a successful fertilizer, so you want to know how many samples you’d need to detect such a change if it really happened. From the pilot data, you can estimate the standard deviation of this population. Then we can follow similar steps as the above example to obtain power for different sample sizes:

PowerVsSampleSize_Pilot

So, we see that it would take just under 100 samples to detect a difference in means of 10 cm, if the standard deviation of these pilot data is representative of the populations we plan to sample.

Past discussion belabored the importance of not declaring the existence of a significant difference when there really is not one . Given such caution, you would think that a low-power statistical test should not be a big concern, but see our first experiment above for just one example of the link between low statistical power and the misinterpretation of test results. In fact, statistical power and interpretation of results are always closely tied. Imagine a low-powered statistical test where you reject the null hypothesis 5 out of 100 times when it is true (alpha = 0.05) and reject the null hypothesis only 10 times out of a 100 when it is false (power = 0.10 really bad but, sadly, not uncommon). Now you are looking at rejection of that null hypothesis for a particular study and you think “Yay, I proved a difference”  Wait! In this scenario, if the null hypothesis has equal probability of being true or false, 5 out of 15, or a third of the rejections occur when the null hypothesis is actually true! If the test instead rejected the null hypothesis 80 times (power = 0.80) out of a 100 when it was false, then only about 6% of the rejections would occur when the null hypothesis is true, a much better error rate.

Although it is very difficult to explain in an intuitively satisfying way, there are many situations in which significant results from low-powered tests are very likely to be wrong.  It’s a great example of how interpreting p-values depends on many factors associated with the context of the study such as statistical power and the probability that the null hypothesis is true or false (see Nuzzo 2014).

It is worth noting that power analyses will vary depending on the study design, statistical analyses, and other factors unique to each experiment. For further reading, this page is a good start. A great resource for doing power analyses and determining your sample size requirements is http://powerandsamplesize.com. Because power analyses touch on all aspects of study design, they are useful to conduct during exploratory data analyses and can buffer against our natural tendencies to rush through statistical analyses.

Like the previous installment, we have included the R code that generated all the graphics in this article, which you can modify to suit the statistical needs of your study. As you have hopefully gathered from all of these installments, there is no one-size-fits-all tool, but thinking through your assumptions and carefully customizing your statistical analyses will engender better science.

This concludes the series on statistical thinking. We hope you have enjoyed these five installments and have learned something that you can apply to your own work! Thanks once again for reading.

 

Sources

HyLown Consulting LLC (2017) “Power and Sample Size.” URL: http://powerandsamplesize.com/

Nuzzo, R. 2014.  Statistical Errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. Nature 506:150-152.

UCLA (2017) “Introduction to power analysis.” URL: https://stats.idre.ucla.edu/other/mult-pkg/seminars/intro-power/

 

This summer statistical thinking series is written by Will Chen, recent MS graduate from the Quantitative Ecology and Resource Management program at the University of Washington. Will is pursuing a career in science communication, following his participation in and organization of science communication training such as ComSciCon and University of Washington’s Engage program. You can contact him via wchen1642@gmail.com or follow him on Twitter @ChenWillMath. Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.

Advertisements