I’m sure we’ve all seen it — papers reporting p-values < 0.0001, and they have samples sizes in the hundreds, and they’re just comparing two groups. If you haven’t seen it, here’s an example for you. Go ahead and click, I’ll be waiting here.

Alrighty, so that paper, Neeki, et al. (2018), was looking at (among other things) whether or not methamphetamine users had higher pulse rates on the scene of an accident compared to those who were not methamphetamine users. And lo and behold — they claim that they see a difference! In fact, here’s part of the table where they report their results:

methamphetamine user (n=449) | non-methamphetamine user (n=449) | p-value | |

Pulse on scene | 102.6 +/- 22.06 | 94.31 +/- 20.97 | < 0.0001 |

Pulse on arrival at trauma center | 99.96 +/- 22.76 | 94.0 +/- 20.61 | < 0.0001 |

Systolic blood pressure on scene | 126.89 +/- 22.69 | 131.52 +/- 24.34 | 0.0149 |

The first thing that struck me is that there is a huge amount of variability. The next issue

I have is that I can’t see the underlying data. So I don’t know what the actual shape of the

distribution is, but it’s probably fair to say it’s normally distributed (assuming the Central

Limit Theorem applies).

But look at that variability. I find it hard to fathom that you can have such small mean differences (the difference in the means is about 8.2 and 6.0 for the pulse on scene and pulse on arrival at the trauma center, respectively), yet have such huge standard deviations! This just shouldn’t be, right?

So what’s going on here?

## The t-Test Is Inappropriate for Large Sample Sizes — Here’s Why

The formula for the t-score is:

\( t=\frac{\bar{x}_1-\bar{x}_2}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \)where t is the t-score, 𝑥1̄ is the mean of group 1, 𝑥2̄ is the mean of group 2, 𝑠𝑝 is the pooled

standard deviation, 𝑛1 is the number of samples in group 1, and 𝑛2 is the number of samples

in group 2.

Think about what happens when you have a small number of samples in both groups. So 𝑛1

and 𝑛2 are both small, let’s say 5. Then you will have \(\frac{1}{5} + \frac{1}{5}\) under the square root. That is 0.2 + 0.2 = 0.4. So 0.4 is going to be under the square root. Now take the square root of 0.4 and you have a larger number, you get 0.63. So that pooled standard deviation gets multiplied by 0.63. Which makes it marginally smaller.

By the way, the pooled standard deviation is calculated as:

\( S_p = \sqrt{\frac{S_1^2(n_1 – 1)}{n_1 + n_2 – 2}+\frac{S_2^2(n_2 – 1)}{n_1 + n_2 – 2}}\)That pooled standard deviation is simply a weighted average of the standard deviation – in other words, all that math with respect to the sample sizes doesn’t matter a whole lot in the grand scheme of things.

Those samples sizes DO MATTER when calculating the t-score, though. And that’s what we’re going to see.

So if you have, let’s say sample sizes of 449 for both groups, instead of 5 like our example above. Now what happens to that pooled standard deviation in the denominator?

Well, we have \(\frac{1}{449} = 0.002\). So, \(\frac{1}{449} + \frac{1}{449} = 0.004\). And then the square root of that is = 0.06. So what’s happening is that this really large sample size ends up making the pooled standard deviation even smaller. So, let’s assume that we have pooled standard deviation of 20.2. Multiplying that by 0.006 gives us a denominator of 1.212 – significantly smaller than the presumed denominator of 12.726 if we had a smaller sample size of 5 in each group.

What this means is that the denominator in the large sample size analysis is going to be 10.5 times smaller than in the smaller sample size analysis. If our difference in means was 5.96, then our t-scores would be:

Denominator | Numerator | t-Score | |

n = 5 each group | 12.726 | 5.96 | 0.47 |

n = 449 each group | 1.212 | 5.96 | 4.92 |

That’s a huge difference in t-score. And do those t-scores actually make a difference in terms of p-values? You betcha!

We can see that when the sample size is 5 in each group, we have a p-value of 0.33. But when you have 449 samples per group, the p-value is much, much lower at 5.14×10-^{7}.

So you’ll say, “Okay, Lyle, great – you’ve shown that large samples can drive the p-values down. So what?!”

Okay, let’s address the so-what.

## The Law of Large Numbers and the z-Test

As I said earlier, the problem here is that the sample sizes are so large that our sample data are acting like the population. What that means is that we really shouldn’t be using a t-test. The t-test is based on the idea that we’re dealing with samples from a distribution, not the actual distribution. Once your sample size gets large enough (and what large enough means is really dependent upon the data), then you need to treat the data differently – you need to treat it like the distribution.

Why is that?

The t-test accounts for the fact that you are using samples by adjusting the sample standard deviation by the number of samples. But, as we can see above, you may end up finding statistically significant results when you shouldn’t based on the fact that your samples are simply too large.

So let’s do the same comparison, but this time we’re going to use the normal distribution and the z-test instead of the t-test.

We will assume our mean difference is 5.96. We will also assume that we have equal variance to make the math easier, and we will assume a standard deviation of 20.2.

The z-test has the following formula:

\(Z=\frac{\bar{x_1}-\bar{x_2}}{s}\)where \(\bar{x_1}\) is the mean of group 1, \(\bar{x_2}\) is the mean of group 2, and \(s\) is the standard deviation. You will recognize that the numerator in the z-test is the effect size.

So, given that information, our z-score is:

\(Z = \frac{5.96}{20.2} = 0.295\)That’s a pretty small z-score, but let’s see what the comes out to in terms of a p-value:

That’s a p-value of 0.38.

But wait a second – using the t-test, we got a p-value of 5.14×10-^{7} when we had a sample size of 449. And now you’re saying that if we treat the data as a distribution, we no longer have that hugely significant p-value? How can that be?

Well, let’s try this a different way. We’ll use the data from the paper – specifically the pulse on scene data. And what we’ll do is we’ll sample 449 samples from each distribution – methamphetamine users and not. And we’ll see which p-value is closer.

We can see that the distribution of differences clearly includes the value 0 – so clearly since 0 is a valid difference, we can already tell that in reality, there is no effect here when we look at the data as a population.

And the mean difference is around 7 beats per minute (I’m not showing all of my R code for this). We can also see that the probability that the difference is greater than 0 is 0.58, which gives us a p-value of around 1-.58 = 0.42. This p-value from the simulated population is awfully close to the p-value we computed of 0.38 based on the assumption we are dealing with the population. The reason for the difference is because we simulated 449 people from the hypothetical population we used in the earlier analysis.

The bottom-line is this: at 449 samples per group, the sample is behaving more like the actual population than it is not.

Again, so what? Well, in terms of the study, this means that we have several false positive findings in a peer-reviewed paper. If we believed the paper, we would assume that we could identify people as methamphetamine users at the scene of a traumatic event based solely on their pulse. This becomes a huge issue in the legal world, as defendants could use the pulse of an accident victim on the scene of an accident to claim that the plaintiff was a drug user and was intoxicated at the time of the accident.

## So What Is The Impact of Increasing Sample Sizes?

Good question, I ran some calculations.

Here we can see the t-scores on the y-axis. We see the degrees of freedom on the x-axis (degrees of freedom is sample size group 1 + sample size group 2 – 2); this is a surrogate for sample size. You can see the rate of growth of the t-scores as a function of the degrees of freedom. What this means is that the t-scores themselves grow quite quickly.

You can see that the cut-off for calling something significant doesn’t change much once you

get to around 10-15 or so degrees of freedom (x-axis). 10 degrees of freedom would be equivalent to anything adding up to an n+n -2 = 10. So, if n + n = 12, you’ve made it to 10 degrees of

freedom.

What this means is that the sample size does not change the cut-off value for the t-score to

be significant that much once you hit around 10-15 degrees of freedom (so an n = 6 or so).

The bottom line: the sample size drastically impacts the size of the t-score. And it is the magnitude of the t-score that determines statistical significance.

So in this example, where both groups have very large standard deviations (22.76 and 20.61

beats per minute), 91 degrees of freedom, and higher, are all that is needed to drive the t-score

into the statistically significant range (alpha = 0.05, or two-tailed p < 0.05).

Note, this means that at smaller sample sizes, a mean difference of 5.96, with standard deviations

in the methamphetamine user and non-user groups of 22.76 and 20.61 beats per minute,

respectively, is not statistically significant. Therefore, it was the large sample size that allowed this particular study to identify a small mean difference (effect size) of 5.96 with huge standard deviations of 22.76 and 20.61 significant.

## So When Does Sample Size (for this data) Begin to Look Like the Population?

The answer is a bit subjective, but I’m going to use some plots to try to highlight find this

point where the samples do begin to look like the population. I will use the root mean square

deviation across a range of sample sizes to get at this.

Based on the loess model of the root mean square deviation of the means (the line in the

plot above), it is clear that the samples begin to clearly recapitulate the population starting

at 50 samples, and overwhelmingly recapitulate the population around 350 for this particular

population with a very wide variance.

Here is the information for the standard deviation:

Again, based on the loess model of the root mean square deviation of the standard deviation

(the line in the plot above), it is clear that the samples begin to clearly recapitulate the

population starting at 50 samples, and overwhelmingly recapitulate the population around

350 for this particular population with a very wide variance.

How does this all compare if we had a population with a much smaller variance? Let’s see.

With the smaller variance, we can see there’s less spread in the sampled means, as expected.

Again, we begin to see some recapitulation, on average, at a much smaller sample size — probably closer to 20-25. We pretty well have overall recapitulation by a sample size of about 100 (note that the y-axes on this figure and the one above are different).

## Brass Tacks — What Does This All Mean?

- As the sample size increases, samples will begin to operate and appear more and more

like the population they are drawn from. This is the Law of Large Numbers. - Rules of thumb for when a sample is large enough to “invoke” the Law of Large Numbers

that I’ve see in the published literature usually say n >= 30 (here is one example). I’m

not sure I’d go that low, but I’d say a general rule of thumb of n=50 is probably valid

based on these simulations. - Using a t-test with large sample sizes will lead to false positive results (with increasing

probability as the sample size increases). Once you’re able to invoke the Law of Large

Numbers, you should no longer use the t-test, and should instead rely upon analyses

using the Z-score for populations (if the data are normally distributed). If your data are

not normally distributed (you probably shouldn’t be using the t-test, but I digress), and

you should use an appropriate distribution-based analysis. - Be very skeptical of any A/B testing or anything that uses a t-test (or similar that requires adjustment based on sample sizes) as the sample sizes get larger. A general rule of thumb: if you can invoke the Law of Large Numbers because your sample looks and acts like the population, then handle your analysis as if you are analyzing population data.