**Are apparent patterns indicative of population differences or simply caused by different sample sizes?**

*I am working with E. Ashley Steel and Rhonda Mazza at the PNW Research Station to write short articles on how we can improve the way we think about statistics. Consequently, I am posting a series of five blogs that **explores statistical thinking, provides methods to train intuition, and instills a healthy dose of skepticism. Subscribe to this blog or follow me @ChenWillMath to know when the next one comes out!*

We begin by looking at how the wiring of the brain interferes with our ability to process statistics. The way we internalize information and make decisions can be broken down into two categories:

- System 1 thinking that is automatic and intuition-based
- System 2 thinking that is more deliberate and analytic

Unfortunately, the impulsive nature of System 1 thinking tends to get us into trouble when we interpret statistics. For example, look at the following map of the lower 48 United States.

It illustrates the counties that exhibit the highest 10% of kidney cancer rates (i.e. number of per capita kidney cancer cases), colored by whether they are predominantly rural or urban. Note that there are more rural counties represented on the map than urban counties and that many of the cancer-prevalent counties are in the South or Midwest.

Why might that be? Perhaps rural areas tend to have less access to clean water, which could adversely affect kidney function? Perhaps there are more factories in these areas leading to more health issues?

Before you get too far, let me show you another map, this time of the counties in the bottom 10% of kidney rate incidence.

Interestingly, rural areas appear over-represented among the counties with the lowest kidney cancer rates as well! What is going on?

This was the conundrum that Howard Wainer delved into in an article titled “The most dangerous equation”, published in the American Scientist in 2007. Wainer explained how trends can appear even when the underlying probability of an event occurring is constant. Using data from the United States Census Bureau, we have simulated that scenario in the maps above.

The effect you are seeing has nothing to do with rural versus urban, though it would make a believable headline. The real culprit is population size. It turns out that smaller samples, such as less populous counties, are more prone to exhibiting extreme results. Let us explore this further.

Imagine you flipped 3 (fair) coins. The chance of getting either all heads or all tails is 25%. Now what is the chance of getting all heads or all tails when flipping 30 coins? Less than 1 in 10,000. Despite the identical chance for any one coin to turn up heads (or tails), larger collections of coin flips are less likely to all be heads.

The take home point: our brains are predisposed to look for and interpret patterns. However, strong patterns, regardless of tempting explanations, can be caused by random chance. Here, sample-size differences across counties are responsible for observed kidney cancer rate differences, despite the constant individual risk of kidney cancer (which is likely not the case, but that is a different discussion).

So, what should scientists and science readers do? The first step is to remain vigilant. When confronted with apparent patterns, consider whether they might be due to chance alone. For data like these, ask if the more extreme responses are exhibited by the samples that contain fewer individuals or cover smaller areas. You might also consider using simulations to assess how much random chance contributes to apparent patterns. Simulations will be discussed in future installments of this summer statistical thinking series.

If you would like to know more about how the brain tricks you into false statistical conclusions, Amos Tversky and Daniel Kahneman discusses this and many other pitfalls.

Thanks for reading and stay tuned for the next installment! We’ll be talking about the p-word!

**Sources**

Bhalla, J. “Kahneman’s Mind-Clarifying Strangers: System 1 & System 2”. URL: http://bigthink.com/errors-we-live-by/kahnemans-mind-clarifying-biases. Accessed 27 May 2017.

Tversky, A. & Kahneman, D. (1974) Judgment under Uncertainty: Heuristics and Biases. *Science ***185** (4157). URL: http://science.sciencemag.org/content/185/4157/1124. Accessed 27 May 2017.

United States Census Bureau. “Geography: Urban and Rural”. URL: https://www.census.gov/geo/reference/urban-rural.html. Accessed 27 May 2017.

Wainer, H. (2007). The Most Dangerous Equation. *American Scientist*. **95** (3). URL: http://www.americanscientist.org/issues/pub/the-most-dangerous-equation. Accessed 27 May 2017.

*Ideas in this series are based on material from the course, “So You Think You Can Do Statistics?” taught by Dr. Peter Guttorp, Statistics, University of Washington with support from Dr. Ashley Steel, PNW Station Statistician and Quantitative Ecologist, and Dr. Martin Liermann, statistician and quantitative ecologist at NOAA’s Northwest Science Center. *

Pingback: Patterns from Noise | Will Chen

Pingback: P-hacking and the garden of forking paths | Will Chen

Pingback: Simulations toolbox: The beauty of permutation tests | Will Chen