Featured post

Simulations toolbox: You’ve got the power

Putting it all together

Welcome to the final installment in my series on understanding statistics! Thanks for following along! I would also like to start this post by thanking E. Ashley Steel and Rhonda Mazza for guiding me throughout this project; I could not have produced these without them.

Before we dive in, I encourage you to read all the previous posts in this series, as they will all be helpful for appreciating this last post:

  1. Your Brain on Statistics
  2. Patterns from Noise
  3. P-hacking and the garden of forking paths
  4. Simulations toolbox: The beauty of permutation tests

We finish by exploring power analyses, which help us quantify and understand our ability to detect statistically significant effects.

Formally, the definition of “power” is the probability that you detect a significant effect when there is one. In practice, power analyses are commonly used to:

  1. Determine the number of samples required to detect an effect with a given power
  2. Calculate power given an effect size and sample size

Let us start with a hypothetical comparison of two populations of seedlings in which one received fertilizer and one did not.  We know that heights are usually normally-distributed.  Past data suggest that the standard deviation of heights for these seedlings is 40. We can simulate any difference of interest and samples of any size to estimate power.

Imagine that one population has a mean of 180 cm and the fertilizer really does lead to an increase of 20 cm. We have the budget to measure 20 seedlings from a treated and an untreated population.  How likely are we to detect this difference of 20 cm in height between treated and untreated trees?  A power analysis based on simulated data will tell us, in advance, the chances of rejecting the null hypothesis in this situation.

Basic steps of determining power:

  1. Select the effect size of interest (e.g. difference of means of 20)
  2. Select the sample size of interest (e.g. 20 samples)
  3. Select the alpha level of interest (i.e. α, or the probability of rejecting given the null hypothesis is true; often .05)
  4. Make an educated estimate of parameters relevant to simulating samples and performing statistical tests (e.g. population means, standard deviations, etc.)
  5. Simulate samples of the populations
  6. Perform relevant statistical test (permutation tests could be useful!)
  7. Record the p-value of the test
  8. Repeat steps 5-7 a thousand times. Of the thousand times, how many of the p-values were significant at the chosen alpha level? That is the power level.

Applying these steps to our example above, we find:


Here, we display all the p-values from 1000 statistical tests of 1000 simulations of 20 observations from each of the two populations above. We see that we would have about a 1 in 3 chance of detecting a difference in means of 20 between these two populations if we were to take 20 samples for our study. Not very good odds if you plan to gamble significant resources to collect the data. And, a great reminder that when you do not see significance, it does not mean that there is no difference. In this example, 2/3 of the time we do not get a significant result even though there is a population level difference of 20!

We might want to know how many samples it would take to achieve, say, an 80% success rate of detecting this difference (i.e. a power of 0.8). By redoing the above analysis for different sample sizes, we can see:


It would take around 60 samples to detect that difference of 20 in the population means 80% of the time. So, before we spend any resources on doing the experiment, power analysis gives us an idea of how feasible it is to expect to detect significant trends if they really are there. We can start asking logistical questions like “Is it possible to obtain 60 samples to support a sufficiently-powerful statistical test?”

This is all well and good in theory, but you might be wondering “How do we know what effect size we want to try and detect?” or “What do we set as the standard deviation in these analyses?” Those would be great questions. After all, if we knew these values, we would not need to conduct a study! Here is where educated estimates come in, whether that be theory and common sense (e.g. an effect size of X would be of ecological significance), previous studies, pilot studies, or regulations (e.g. a treatment must have at least Y units of impact on Z).

In fact, let us say you have some data from a small pilot study, which is plotted below:


This could be the height of a population of plants, which you assume to be normally distributed, and you are interested in the effect of a different new fertilizer on the plants’ average height. A change in average height of 10 cm would be considered a successful fertilizer, so you want to know how many samples you’d need to detect such a change if it really happened. From the pilot data, you can estimate the standard deviation of this population. Then we can follow similar steps as the above example to obtain power for different sample sizes:


So, we see that it would take just under 100 samples to detect a difference in means of 10 cm, if the standard deviation of these pilot data is representative of the populations we plan to sample.

Past discussion belabored the importance of not declaring the existence of a significant difference when there really is not one . Given such caution, you would think that a low-power statistical test should not be a big concern, but see our first experiment above for just one example of the link between low statistical power and the misinterpretation of test results. In fact, statistical power and interpretation of results are always closely tied. Imagine a low-powered statistical test where you reject the null hypothesis 5 out of 100 times when it is true (alpha = 0.05) and reject the null hypothesis only 10 times out of a 100 when it is false (power = 0.10 really bad but, sadly, not uncommon). Now you are looking at rejection of that null hypothesis for a particular study and you think “Yay, I proved a difference”  Wait! In this scenario, if the null hypothesis has equal probability of being true or false, 5 out of 15, or a third of the rejections occur when the null hypothesis is actually true! If the test instead rejected the null hypothesis 80 times (power = 0.80) out of a 100 when it was false, then only about 6% of the rejections would occur when the null hypothesis is true, a much better error rate.

Although it is very difficult to explain in an intuitively satisfying way, there are many situations in which significant results from low-powered tests are very likely to be wrong.  It’s a great example of how interpreting p-values depends on many factors associated with the context of the study such as statistical power and the probability that the null hypothesis is true or false (see Nuzzo 2014).

It is worth noting that power analyses will vary depending on the study design, statistical analyses, and other factors unique to each experiment. For further reading, this page is a good start. A great resource for doing power analyses and determining your sample size requirements is http://powerandsamplesize.com. Because power analyses touch on all aspects of study design, they are useful to conduct during exploratory data analyses and can buffer against our natural tendencies to rush through statistical analyses.

Like the previous installment, we have included the R code that generated all the graphics in this article, which you can modify to suit the statistical needs of your study. As you have hopefully gathered from all of these installments, there is no one-size-fits-all tool, but thinking through your assumptions and carefully customizing your statistical analyses will engender better science.

This concludes the series on statistical thinking. We hope you have enjoyed these five installments and have learned something that you can apply to your own work! Thanks once again for reading.



HyLown Consulting LLC (2017) “Power and Sample Size.” URL: http://powerandsamplesize.com/

Nuzzo, R. 2014.  Statistical Errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. Nature 506:150-152.

UCLA (2017) “Introduction to power analysis.” URL: https://stats.idre.ucla.edu/other/mult-pkg/seminars/intro-power/


This summer statistical thinking series is written by Will Chen, recent MS graduate from the Quantitative Ecology and Resource Management program at the University of Washington. Will is pursuing a career in science communication, following his participation in and organization of science communication training such as ComSciCon and University of Washington’s Engage program. You can contact him via wchen1642@gmail.com or follow him on Twitter @ChenWillMath. Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.


Simulations toolbox: The beauty of permutation tests

A “Choose-your-own-adventure” hypothesis test

The last three posts focused on revealing our human mistakes in interpreting statistics and providing solutions to overcome those pitfalls. Now, we will focus on tools that you might consider using in every statistical project.

First up is the permutation test, an alternative to the t-test for comparing two populations. “An alternative,” you say, “why would we ever need that?” Unfortunately, real-world experiments do not always yield perfect data. Sometimes you only have 15 samples. Sometimes the populations were unevenly sampled. Often there is no guarantee that the underlying distributions are normal with equal variances in the two populations. While the t-test is robust, you may be uncomfortable with violating assumptions of the t-test. For example, you might have samples from two populations that look like the samples below:


Permutation tests shine here because they make fewer assumptions about your data. Rather than assume any underlying distribution, the first step in a permutation test is to construct a null distribution from the data by shuffling (or “permuting”) the data so that the population labels are scrambled. After all, if the two populations are the same, then the values should be exchangeable and shuffling the labels should be meaningless. Repeating this shuffle a 1000 times or so and calculating the difference in means each time provides a null distribution against which to assess the unusualness of the observed (unpermuted) difference. The proportion of values in the null distribution that are more extreme than the actual difference is the p-value of the permutation test. Below, the red lines indicate the observed difference in means for the comparison shown above. A comparison of the two observed means yields a p-value of 0.004.


Basic steps of a permutation test (a.k.a. Randomization Test):

  1. Calculate the observed test statistic, for example the difference between the means of Sample 1 versus Sample 2.
  2. Permute (i.e. shuffle) the sample labels of the observations to simulate a new Sample 1 and Sample 2 from the same data.
  3. Calculate the same test statistic for the permuted data. This gives one example of what the test statistic would look like if the null hypothesis of no difference between populations were true.
  4. Repeat steps 2 and 3 a thousand times. With each repetition, you get an additional example of what the test statistic looks like, by chance, when the null hypothesis is true.
  5. Compare the observed test statistic (step 1) to the distribution of values that describe how things would be if the null hypothesis were true. The proportion of the permuted statistics that is more extreme than the observed value is the p-value for the permutation test.

Comparing populations in other ways can also be useful. Because you define your test statistic as part of the permutation test, you can design practically any statistic you want to compare. Instead of looking at the difference in means, for example, you can calculate the difference in variances for each permutation. As seen below, there is a significant difference in the variances of the two samples we used earlier. You can even look at the difference in ex, where x is the measured data, if you wanted. The choices are endless and the same procedure applies. Simply substitute in your custom metric wherever you would calculate a test statistic.


Of course, permutation tests are not without limitations. The nature of shuffling labels in the data set to form a null distribution assumes that there is no structure in your data – for example, no correlations or grouping among samples that would be lost when permuting. Additionally, permutation tests only provide p-values. As we have discussed in a previous installment, confidence intervals and effect sizes are also important metrics for judging the statistical and practical significance of observed patterns. These metrics can also be captured using simulation approaches.

Nevertheless, when conditions are right, permutation tests are a powerful tool for hypothesis testing that circumvents some of the assumptions of parametric tests and allows increased flexibility in making insightful comparisons between populations. Here is the R code that generated all the graphics and tests in this article. You can see how a permutation test compares to a t-test and try creating test statistics other than the mean. Feel free to experiment!

Thanks again for reading, and hope you will join us in a few weeks for the final post of this series! It will cover power analyses, a simulation-based tool that can aid in study design and provide insight into the strength of your statistical tests. Stay tuned!



Ong, D. C. (2014) “A primer to bootstrapping; and an overview of doBootstrap.” URL: https://web.stanford.edu/class/psych252/tutorials/doBootstrapPrimer.pdf

Past Articles in the Series

  1. Your Brain on Statistics
  2. Patterns from Noise
  3. P-hacking and the garden of forking paths


I am working with E. Ashley Steel and Rhonda Mazza at the PNW Research Station to write short articles on how we can improve the way we think about statistics. Consequently, I am posting a series of five blogs that explores statistical thinking, provides methods to train intuition, and instills a healthy dose of skepticism. Subscribe to this blog or follow me @ChenWillMath to know when the next one comes out!

Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.

P-hacking and the garden of forking paths

Multiple comparisons can lead to spurious conclusions

This time, we continue our discussion around p-values by discussing a common trap that people fall into when analyzing data.

Imagine the following: you are interested in the influence of hydrology on the abundance of a particular fish species. Because there are many aspects of the hydrological cycle that might play a role in regulating this species, you perform multiple tests to see if there is a statistically significant relationship between fish abundance and hydrological flow metrics such as summer flow magnitude, timing of flooding, duration of low flows, and more. And lo and behold, you find that summer flow magnitude is positively correlated with fish abundance, with a p­-value less than 0.05! Great, time to write up the paper describing how summer flows are important in regulating the abundance of this species, right?

Well… hold on a minute. Unfortunately, this would be a textbook example of p-hacking – sorting through numerous statistical tests to determine which aspects of a particular hypothesis (e.g. that hydrology influences fish abundance) are significant, and reporting on only the significant findings. The issue when looking at multiple comparisons like this is that the more tests you perform, the more likely you are to see a “weird” response.

Let us consider a hypothetical scenario where we know the null hypothesis, e.g. there is no effect of a particular flow metric on fish abundance, to be true. Setting our significance level to α = 0.05, how often do we falsely reject the null hypothesis and incorrectly conclude that there really is an effect when making different numbers of comparisons?

Probability of falsely rejecting the null hypothesis (red) when it is true


Unfortunately, as we do more and more comparisons, we are more and more likely to get a significant p-value simply by chance alone. For a single statistical test using a significance threshold of 0.05, there is a 5% chance to reject the null hypothesis even when it is true (known as a Type I error). With five comparisons using the same significance level, there is only a 0.955 = 77.4% chance of correctly concluding that there is no relationship or almost a 25% chance of finding a significant relationship even though none exists. The chance of finding at least one statistically significant result simply by chance climbs to 40.1% with ten tests. This is a hazard because, as you might recall from the last installment, it is impossible to differentiate between a truly significant relationship versus simply a weird occurrence.  By the time you’ve conducted over 20 tests, including exploratory graphs used as unofficial tests, on data where there is no true relationship, you’ve got over a 60% chance of at least one “weird” result that leads you to wrongly conclude that there is a relationship.

For a more hands-on experience, try p-hacking in this interactive from FiveThirtyEight, which puts you in the statistician’s seat of a study on the health of the United States economy under the two major political parties. After playing around with their interactive visualization, you will find that you can “obtain evidence” for literally any hypothesis you want about how the United States economy is affected by whether Democrats or Republicans are in office. It just takes some tweaking of predictors and responses.

The garden of forking paths

Now, you might be thinking, “Jeez, with all these challenges around multiple tests, why not just focus on the hypotheses and statistical tests that are most likely to be interesting?” The problem comes when you realize that there are a multitude of ways to approach a hypothesis, and there are numerous informal decisions we make when looking at our data about which tests or data we use. For example, which covariates to focus on, how data is transformed, how many categorical bins to divide data into, and so on. These endless choices lead to what statistician Andrew Gelman calls the “garden of forking paths”, where researchers may explore their data extensively, but only report a subset of the statistical methods they ended up utilizing (e.g. the models that fit the data better).

Avoiding this pitfall

There are several possibilities for mitigating the hazard of falsely rejecting null hypotheses when making multiple comparisons. First, each time you conduct multiple statistical tests related to a single overarching hypothesis, you can adjust your significance level from α = 0.05 to α* = 0.05/n, where n is the number of tests you are performing. This is known as a Bonferroni correction, and it brings the probability of a false rejection over the whole suite of tests in line with the rate you would see if you were only conducting one test.

Distinguish between hypothesis testing and exploratory data analysis

We often learn that exploratory data analysis should be precursor to statistical analysis. Indeed, checking for erroneous observations and distributional assumptions needs to come before statistical testing. Other forms of exploratory analysis, however, can be conducted after formal hypothesis testing.  In the above example, what if the researcher had identified, from previous research, the two strongest hypotheses about how hydrology might impact fish abundance and only formally conducted those two tests. The scientist would then be free to continue graphing and exploring the data to identify other facets of the hydrological cycle that appear to have a relationship with fish abundance as ideas for future research or to spark new mental models of how the ecological system might be structured. These ideas and models could then be tested with new, independent data.

Establish hypotheses and statistical methodology early

This leads to the crucial importance of establishing hypotheses and methods, including statistical models from the outset, ideally before data collection even begins. You can formally “preregister” your hypotheses or you can at least be mindful of not altering your methodology based on data exploration. This practice will dissuade you from unintentional p-hacking and from digging too hard for expected results. Consequently, you will fortify your statistical assertions and lend more credence to your science when you follow up with the full results of your analyses.

Hopefully, this has been helpful for being cognizant of p-hacking and will prevent you from being an unintentional p-hacker. The article that accompanies the FiveThirtyEight interactive above is also a great read for how p-hacking is not the end of credible science.

Thanks again for reading, and hope you’ll be back for the next post in this series! We will be diving into simulations to aid in your statistical analyses.



Aschwanden, C. (2015) “Science Isn’t Broken.” URL: https://fivethirtyeight.com/features/science-isnt-broken/

Center for Open Science (2017) “Preregistration Challenge” URL: https://cos.io/prereg/

Gelman, A. and Loken, E. (2013) “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time.” URL: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

Past Articles in the Series

  1. Your Brain on Statistics
  2. Patterns from Noise



I am working with E. Ashley Steel and Rhonda Mazza at the PNW Research Station to write short articles on how we can improve the way we think about statistics. Consequently, I am posting a series of five blogs that explores statistical thinking, provides methods to train intuition, and instills a healthy dose of skepticism. Subscribe to this blog or follow me @ChenWillMath to know when the next one comes out!

Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.

Patterns from Noise

What does the p-value really tell us?

Welcome back! If you missed the previous installment, you can find it here.

Continuing the series, we’ll be talking about the p-word. That’s right, “p-values”. A concept so central to statistics, yet one of the most often misunderstood.

Not too long ago, the Journal of Basic and Applied Psychology straight up banned p-values from appearing in their articles. This and other controversies about the use and interpretation of p-values led the American Statistical Association (ASA) to voice their thoughts on p-values; writing such recommendations for the fundamental use of statistics was unprecedented for the organization.

Part of the confusion stems from the complacency with which we teach p-values, leading to blind applications of p-values as the litmus test for significant findings.

Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.

Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.

– George Cobb

Snide comments aside, let us unpack what a p-value does and does not tell us. First, take a look at the following twenty sets of randomly generated data:PatternFromNoise.png

Each one of the boxes contains 50 points whose x-y coordinates were randomly generated from a normal distribution with mean 0 and variance 1. Yet, we see that there is occasionally a set of points that appears to have a trend, such as the one highlighted in red, which turns out to exhibit a correlation of 0.45. If even random noise can display patterns, how do we discern when we have a real mechanism influencing some response versus simply random data? P-values provide this support by giving us a measure of how “weird” an observed pattern is, given a proposal of how the world works.

More formally, the definition of a p-value is “the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value” (taken from the ASA). Note that this says nothing about the real world. Rather, it measures how much doubt we have about one particular statistical view of the world. If our null hypothesis were true and our model of the world pretty accurate, a “statistically significant p-value”, means that something unlikely has happened (where unlikely could be defined as a 1 in 20 chance). So unlikely that it throws significant doubt into whether that null hypothesis is a very good model of the world after all. It is important to note, however, that this does not mean that your alternative hypothesis is true.

Conversely, an insignificant p-value is not an indication that your null hypothesis is true. Rather, it suggests a lack of evidence as to whether your null hypothesis is an inaccurate model of the world. The null hypothesis may well be accurate or you may simply not have collected enough evidence to throw significant doubt on an inaccurate null hypothesis. A common trap is to argue for a practical effect because of some perceived pattern even though the p-value is insignificant. Resist this temptation, as the insignificant p-value indicates that the pattern is not particularly unusual even under the null hypothesis.  Also resist the temptation to state or even imply that the insignificant p-value indicates (a) there is no effect; (b) there is no difference; or (c) the two populations are the same. Absence of evidence is not evidence of absence.

Ultimately, the p-value is only one aspect of statistical analyses, which is, in turn, only one step in the life-cycle of science. P-values only describe how likely it might be to get data like yours if the null hypothesis were really how the world worked.

There are, however, some practices that can supplement p-values:

  1. Graph the data. For example, how different do two groups look when you make box plots of their responses? How much data do you really have? Large sample sizes can help elucidate significant differences (a topic we will dive into more in a later installment about statistical power). Are there unusual observations?
  2. More formally, estimate the size of the effect that you are seeing (e.g. via a confidence interval). Is it a potentially large effect that is not significant or a very small effect that is statistically significant? Is the effect size you see relevant to potential real-world decisions? A 95% confidence interval of [0.01, 0.05] may be significantly different from zero, but if that interval represents say the increase in °C of river temperature after a wildfire, is it a relevant difference to whatever decision is at hand?
  3. Conduct multiple studies testing the same hypothesis. Real world data is noisy. Each additional study allows you to update prior information and possibly provide more conclusive support for or against a hypothesis. This is, in fact, the basic idea behind Bayesian statistics, which we do not have the space to cover here, but go here for an introduction on the topic.
  4. Use alternative metrics to corroborate your p-values, such as likelihood ratios or Bayes factors

Hopefully, we have provided significant enlightenment on p-values. Next time, we will continue thinking about p-values, specifically the risks involved with testing multiple hypotheses in the same analysis.

Thanks for reading and hope you will join us for the next installment in a few weeks!


Etz, A. (2015) “Understanding Bayes: A Look at the Likelihood.” URL: https://alexanderetz.com/2015/04/15/understanding-bayes-a-look-at-the-likelihood/

Kurt, W. (2016) “A Guide to Bayesian Statistics.” URL: https://www.countbayesie.com/blog/2016/5/1/a-guide-to-bayesian-statistics

Trafimow, D. and Marks, M. (2015) “Editorial.” URL: http://www.tandfonline.com/doi/abs/10.1080/01973533.2015.1012991

Wasserstein, R.L., and Lazar, N.A. (2016) “The ASA’s statement on p-values: context, process, and purpose.” URL: http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108


Past Articles in the Series

  1. Your Brain on Statistics


Bonus Article: A different type of p-value…


I am working with E. Ashley Steel and Rhonda Mazza at the PNW Research Station to write short articles on how we can improve the way we think about statistics. Consequently, I am posting a series of five blogs that explores statistical thinking, provides methods to train intuition, and instills a healthy dose of skepticism. Subscribe to this blog or follow me @ChenWillMath to know when the next one comes out!

Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.



Your Brain on Statistics

Are apparent patterns indicative of population differences or simply caused by different sample sizes?

I am working with E. Ashley Steel and Rhonda Mazza at the PNW Research Station to write short articles on how we can improve the way we think about statistics. Consequently, I am posting a series of five blogs that explores statistical thinking, provides methods to train intuition, and instills a healthy dose of skepticism. Subscribe to this blog or follow me @ChenWillMath to know when the next one comes out!

We begin by looking at how the wiring of the brain interferes with our ability to process statistics. The way we internalize information and make decisions can be broken down into two categories:

  • System 1 thinking that is automatic and intuition-based
  • System 2 thinking that is more deliberate and analytic

Unfortunately, the impulsive nature of System 1 thinking tends to get us into trouble when we interpret statistics. For example, look at the following map of the lower 48 United States.


It illustrates the counties that exhibit the highest 10% of kidney cancer rates (i.e. number of per capita kidney cancer cases), colored by whether they are predominantly rural or urban. Note that there are more rural counties represented on the map than urban counties and that many of the cancer-prevalent counties are in the South or Midwest.

Why might that be? Perhaps rural areas tend to have less access to clean water, which could adversely affect kidney function? Perhaps there are more factories in these areas leading to more health issues?

Before you get too far, let me show you another map, this time of the counties in the bottom 10% of kidney rate incidence.


Interestingly, rural areas appear over-represented among the counties with the lowest kidney cancer rates as well! What is going on?

This was the conundrum that Howard Wainer delved into in an article titled “The most dangerous equation”, published in the American Scientist in 2007. Wainer explained how trends can appear even when the underlying probability of an event occurring is constant. Using data from the United States Census Bureau, we have simulated that scenario in the maps above.

The effect you are seeing has nothing to do with rural versus urban, though it would make a believable headline. The real culprit is population size. It turns out that smaller samples, such as less populous counties, are more prone to exhibiting extreme results. Let us explore this further.

Imagine you flipped 3 (fair) coins. The chance of getting either all heads or all tails is 25%. Now what is the chance of getting all heads or all tails when flipping 30 coins? Less than 1 in 10,000. Despite the identical chance for any one coin to turn up heads (or tails), larger collections of coin flips are less likely to all be heads.

The take home point: our brains are predisposed to look for and interpret patterns. However, strong patterns, regardless of tempting explanations, can be caused by random chance. Here, sample-size differences across counties are responsible for observed kidney cancer rate differences, despite the constant individual risk of kidney cancer (which is likely not the case, but that is a different discussion).

So, what should scientists and science readers do? The first step is to remain vigilant. When confronted with apparent patterns, consider whether they might be due to chance alone.  For data like these, ask if the more extreme responses are exhibited by the samples that contain fewer individuals or cover smaller areas. You might also consider using simulations to assess how much random chance contributes to apparent patterns. Simulations will be discussed in future installments of this summer statistical thinking series.

If you would like to know more about how the brain tricks you into false statistical conclusions, Amos Tversky and Daniel Kahneman discusses this and many other pitfalls.

Thanks for reading and stay tuned for the next installment! We’ll be talking about the p-word!



Bhalla, J. “Kahneman’s Mind-Clarifying Strangers: System 1 & System 2”. URL: http://bigthink.com/errors-we-live-by/kahnemans-mind-clarifying-biases. Accessed 27 May 2017.

Tversky, A. & Kahneman, D. (1974) Judgment under Uncertainty: Heuristics and Biases. Science 185 (4157). URL: http://science.sciencemag.org/content/185/4157/1124. Accessed 27 May 2017.

United States Census Bureau. “Geography: Urban and Rural”. URL: https://www.census.gov/geo/reference/urban-rural.html. Accessed 27 May 2017.

Wainer, H. (2007). The Most Dangerous Equation. American Scientist. 95 (3). URL: http://www.americanscientist.org/issues/pub/the-most-dangerous-equation. Accessed 27 May 2017.


Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center.

Can we transfer flow-ecology knowledge?

It’s been 7 (!) months since my last post. Better than the year+ of my last hiatus. I wanted to share a blog post I wrote for the Olden lab blog about a month ago on my research. As societal water needs and changing water availability outpace our ability to make recommendations for sustainable water use in individual rivers, we’ll need to rely on knowledge that can be applied to multiple rivers. But how feasible is this? Read about my work to explore this question with freshwater fish in the American southwest here!

I’m going to ease back into posting over the next few months. Expect posts about science communication, statistics, and more!

Geek Heresy and EarthGames

I’ve recently started reading a fantastic book on a friend’s recommendation called Geek Heresy: Rescuing Social Change from the Cult of Technology. The book takes a look at the culture of technology in human society, with the premise of delving into how technology came to be so highly-regarded as a tool for social change and why this view can be problematic. I’m only one chapter in, but Geek Heresy has already got me thinking about what is likely a central theme: technology does little for social change without the right people to support the change.

Over the weekend, I helped represent EarthGames UW at the second annual Seattle Youth Climate Action Network (Seattle Youth CAN) Summit. During the lunch hour, we let the eager high-schoolers explore some of the games that EarthGames designed over the past year. We followed this up with an activity-packed hour where we guided a dozen students in developing a concept for their very own environmental game!

The event ended up being the highlight of my weekend. I met a young woman who had already designed her own game about pollution using HTML/Javascript, and within the hour-long game jam, we already had a game concept down (tower defense style game about overfishing)! I got to meet a bunch of really smart kids that were excited to bring about environmental change.

Now, you might be wondering why these two pieces are in the same blog post. Throughout the event, I kept thinking back to Geek Heresy and how these games are like the teaching tools presented at the beginning of the book. While EarthGames UW was founded on the motivation to teach people about climate change and the environment, the games that we make are just as likely to see the same downfalls as the laptops-in-the-wall presented in Geek Heresy’s first chapter — a lack of mentoring or guidance means less effective or a complete lack of social change.

I’m glad that EarthGames is taking on more opportunities to engage with youth with games and game design. There’s a lot of potential in using games to engage with the public, and even more in using game design to let the public engage with us and each other. I hope EarthGames will continue to foster collaborations with engagement groups to enable change in our society. If I get the chance (and time!), I hope to be able to foster these collaborations myself.

What do you think is essential for social change? How do you go about engaging your community? Let me know in the comments section!