The P-value gives you a means by which you may assess your null and alternative hypotheses.   “The P-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true” (Diez 2015).  We are looking to see how often we would get data like those we observed or data even more extreme in favor of the alternative hypothesis under the assumption that the null hypothesis is true.

Let’s consider what this would look like in the coin flipping example. If you recall, we are trying to figure out if a coin is fair (null hypothesis) or biased in favor of tails (alternative hypothesis) when flipping it,  and we have observed that 8 out of 11 coin flips have resulted in a tails. It turns out that a useful test statistic here is simply the count of the number of tail coin flips, so 8. Next, we assume that the null hypothesis is true, so, assume that the coin is fair. Half of the coin flips should have been tails, so 5.5 flips should have been tails. We saw 8 tails, more tails than expected if the null is true. But is this a really unusual result? If the alternative hypothesis is true, then the coin is biased towards tails. What potential data values are more unusual than what we saw, and would be consistent with the alternative hypothesis being true?

Well, observing 9, 10, or 11 tails out of 11 flips would have been more unusual than 8, and be consistent with the belief that the coin is biased towards tails. So, we compute the probability of observing 8, 9, 10, or 11 tails out of 11 coin flips, under the assumption that the null hypothesis is true. That probability is our P-value.

P-values are probabilities, so they lie between 0 and 1 on a number scale. Small P-values indicate evidence in favor of your alternative hypothesis. The question is… how small is small enough to reject the null hypothesis?

Returning to the coin example, with a test statistic of 8, the P-value is 0.1133. This means that if we assume the coin is fair, the probability of observing 8 or more tails out of 11 coin flips in repeated studies like this is 0.1133. You’d expect roughly 11 out of 100 studies to give results like this if the coin is fair. That’s not very common – it doesn’t happen 20% of the time, but it’s not really rare either – it’s not 0.1% of the time.

When a P-value is “large”, it suggests that our results (and more extreme results) happen fairly regularly when the null hypothesis is true, meaning what we observed is not particularly unusual, so we don’t have sufficient evidence to reject the null hypothesis.

When a P-value is “small”, it suggests that our results (and more extreme results) are fairly unlikely to occur when the null hypothesis is true, meaning what we observed is unusual, and since those are the results we observed, we have sufficient evidence to reject the null hypothesis.

 

Significance Levels

Each individual considering a P-value may have a different definition of “small”. Indeed, “small” may change depending on the context of the study in question. In order to help keep everyone on the same page about what “small” is, you can set a cutoff in advance of performing your test as your threshold of reference. This threshold is called your significance level, or alpha, and is given the notation of α.

The significance level is more than just a threshold though. It conveys information about how often you will be rejecting the null hypothesis even when the null hypothesis is true. Suppose you set α=0.10, and you consider 10 studies where the null hypothesis is true. Out of those ten studies, because you set α=0.10, you would expect to reject the null hypothesis one time, even though the null was true in each case! This is called making a Type I error, and α provides the probability this occurs. Thus, you might also see the significance level, alpha, referred to as your chance of rejecting your null hypothesis when your null hypothesis was true. It is also possible to make a Type II error, where you fail to reject a false null hypothesis.

Significance levels vary based on the study at hand and the consequences of rejecting a true null hypothesis, or failing to reject a false null hypothesis. Some examples of common α-levels are 0.001, 0.01, 0.05, and 0.10. When our P-value is less than or equal to our significance level, we say the result is statistically significant, and reject the null hypothesis.

Before conducting a statistical test, you must always have decided on your acceptable chance of making a Type I error (α-level).  For the purposes of this lab, we will set our cutoff or α-level at 0.05.

For example, say you are interested in the peppered moth in England and suspect that there will be more white moths than gray moths at your study site, which is located far from any city.  Your null hypothesis is that there is no difference in the mean number of white versus gray moths and your alternative is that there is a difference in the mean number of white versus gray moths.  You collect counts of each moth color weekly for three months.

  • Scenario 1: You obtain a P-value of 0.02.  If you selected 0.05 as your α-level or cutoff, then you would reject your null hypothesis and state that there is a difference in the mean number of white moths versus gray moths at your study site.
  • Scenario 2: You obtain a P-value of 0.71.  In this case, you fail to reject your null hypothesis. There is not sufficient evidence to conclude that there is a difference in the mean number of white versus gray moths.

 

Note that we do not accept the null hypothesis in Scenario 2! We just fail to reject the null hypothesis. Why? We had to assume the null hypothesis was true in order to do our computations and assessment. We didn’t prove that it was true, instead we found that the observed data was not unusual when we assumed that the null hypothesis was true.

 

Caveats about P-values

Scientists recognize that setting a specific α-level for a statistical test could result in missing vital information.  For example, if you set α=0.05 and your P-value was 0.052, you would fail to reject the null hypothesis, but the P-value is still providing some evidence in favor of your alternative hypothesis. Therefore, we often consider other ways to examine evidence in support of our hypotheses.  For example, we may want to construct a confidence interval for the difference in means based on our data, rather than just focusing on the observed difference in means and associated P-value from the hypothesis test. A confidence interval provides a range of reasonable values for the parameter of interest (the population characteristic you are interested in, such as a population mean), and allows you to examine the size of an effect (see more below). Both hypothesis tests and confidence intervals factor in the variability in our data, but hypothesis tests generate a P-value that is usually focused on, while confidence intervals actually give you a sense of the variation by providing a range of values for the parameter of interest. As we will see, variability in the data is often of central importance to understanding patterns in biology.

 

Summary on P-values

  1. P-values allow us to determine how often we would get data like those we observed or data even more extreme in favor of the alternative hypothesis under the assumption that the null hypothesis is true.
  2. P-values do not measure the probability that the null hypothesis is true, or the probability that the data were produced by random chance alone.
  3. A statistically significant result (P-value less than or equal to your specified significance level) does not measure the size of an effect or the importance of a result. It is up to you to interpret your results.

Above list modified from Wasserstein and Lazar (2016)