The p-value is a popular tool for statistical inference. Unfortunately, the p-value and its role in hypothesis testing are often misused in drawing scientific conclusions. Concern over the use, and misuse, of what is perhaps the most widely taught statistical practice has led the American Statistical Association to craft a statement on behalf of its members  . For statistical practitioners, a deeper insight into the workings of the p-value is essential for an understanding of statistical hypothesis testing.
The purpose of this paper is to highlight the flexibility of the p-value as an assessment of statistical evidence. An alleged disadvantage of the p-value is its isolation from more rigorously defined likelihood based measures of evidence. This disconnection leads some to prefer competing measures of evidence, such as likelihood ratio statistics or Bayes factors. However, this disconnection can be bridged. In this paper, we draw attention to results establishing the p-value within the likelihood inferential framework.
In Section 2, we discuss the general idea of statistical evidence. In Section 3, we consider the likelihood principle and establish the aforementioned connection with the p-value. In Section 4, we discuss how the link between a p-value and the likelihood ratio establishes a link between a p-value and a Bayes factor. We close the paper in Section 5 with some concluding remarks on how the p-value plays a role in a broader class of hypothesis testing problems than may be currently appreciated.
2. The P-Value and Evidence
Before going any further, let’s take a moment to think about what is meant by statistical evidence. Let’s think of a researcher collecting data on some natural phenomenon in order to determine which of two (or more) scientific hypotheses is most valid. Data favors a hypothesis when that hypothesis provides a reasonable explanation for what has been observed. Conversely, data provides evidence against a hypothesis when what has been observed deviates from what would be expected. Scientific evidence is not equivalent to scientific belief. It is not until multiple sources of data evidence favor a hypothesis that a foundation of strong belief is built. Because belief arises from multiple researchers and multiple studies, the language for communicating an advancement of scientific knowledge is the language of evidence. Thus, quantification of evidence is a core principle in statistical science.
R.A. Fisher is credited with popularizing the p-value as an objective way for investigators to assess the compatibility between the null hypothesis and the observed data. The p-value is defined as the probability, computed under the null hypothesis, that the test statistic would be equal to or more extreme than its observed value. While the p-value definition is familiar to statistical practitioners, a simple example may help focus on the idea of quantifying evidence. Consider a scientist investigating a binomial probability The goal is to test against a lower tail alternative So, under the null hypothesis. In trials, successes are observed. Since small values of support the alternative, the p-value is computed to be
The null hypothesis is most compatible with data near the center of the null distribution. Data incompatible to the null distribution is then characterized by a small p-value. In this way, the p-value serves as an assessment of evidence against the null hypothesis.
The p-value is a probabilistic measure of evidence, but not a probabilistic measure of belief. The desire to interpret as a probability on the null hypothesis must be resisted. But this leaves open the question of how to represent a
Table 1. P-value scale of evidence.
p-value scale of evidence. Fisher recommended the Table 1  .
The Fisher scale seems to be consistent with common p-value interpretations. For our simple example, we can say there is moderate to borderline evidence against the null hypothesis. In the end, the choice of an appropriate evidence scale should depend on the underlying science, as well as an assessment of the costs and benefits for the application at hand  . Particularly troublesome to the goal of improving scientific discourse is a blind adherence to any threshold separating significant and non-significant results.
A perceived shortcoming of the p-value as an assessment of evidence can be illustrated from our simple example. Note that the p-value is not only a function of the data observed but of more extreme data that has not been observed The definition of the p-value as a tail probability implies that the computation of depends on the sampling distribution of the test statistic. So, the p-value depends on the, perhaps irrelevant, intentions of the investigator, and not merely on the data observed. In this way, the p-value is in violation of the likelihood principle. We will see in the next section, however, that a p-value measure of evidence can be defined to satisfy the likelihood principle. With this result, a major criticism of the p-value is answered.
3. Likelihood Inference
We will take a relatively informal approach in our introduction to likelihood inference. Readers interested in a more rigorous treatment are encouraged to consult   . Simply put, the likelihood principle requires that an evidence measure satisfy two conditions: sufficiency and conditionality. The sufficiency condition states that evidence depend on the data only through a sufficient statistic. The p-value has no real issue in that regard. The conditionality condition states that evidence depends only on the experiment performed, and the data observed, not on the intention of the investigator. To see that the p-value fails in this regard, we return to the simple binomial example. Suppose instead of a predetermined sample size the scientist’s intention was to sample until successes were observed. Under this scenario, the number of trials is a random variable. Under the null hypothesis, Since large values of support the lower tail alternative, the p-value is computed to be
Now, we have moderate to substantial evidence against the null. Equivalent hypotheses, tested from equivalent data, reach different levels of evidence. Computation of the p-value is not invariant to the sampling scheme, even though the plan to collect the data is unrelated to the evidence provided from what is actually observed. That an unambiguous p-value assessment does not seem to be available is a problem we will address.
The development of an evidence measure which does satisfy the likelihood principle proceeds as follows. Let denote the likelihood as a function of an unknown parameter (For simplicity, we take the single parameter case. Nuisance parameters and parameter vectors can be handled with slight adjustments to the development.) Let denote the maximum likelihood estimate. We consider the problem of testing the null hypothesis under the likelihood inference framework. Define the likelihood ratio as Then As decreases, data evidence against the null hypothesis increases. In this sense, provides a measure of evidence against the null hypothesis in the same spirit as a p-value.
We return once more to the binomial data. The likelihood ratio is invariant to sampling scheme. So, the measure of evidence is the same whether the data comes from a binomial or negative binomial. Write
where the sample proportion is the maximum likelihood estimate. For testing with observed data we compute and We can say the data supports the null value at about 20% of the level of support to the maximum likelihood estimate. But while we are successful in creating a measure of evidence which satisfies the likelihood principle, we have lost the familiarity of working with a probability scale.
It would be desirable to calibrate a likelihood scale for evidence with the more familiar p-value scale. We can achieve this goal directly when the likelihood function is of the regular case. Let denote the log-likelihood, and write its Taylor expansion as
The regular case occurs when the log-likelihood can be approximated by a quadratic function. Asymptotics for maximum likelihood estimators are derived under the conditions leading to the regular case. Since then
where is the reciprocal of the observed Fisher information We can then write the likelihood function at the null value as
where is the Wald statistic for testing The likelihood ratio statistic becomes
Let’s introduce a second example to demonstrate the approximation in (1). In a well-known example of data collection  , a statistics class experimented with spinning the newly minted Belgian Euro. Spinning instead of tossing a coin is more sensitive to unequal weighting of the sides. In spins, landed heads side up. Now, the intended sampling scheme is not at all clear from the summary provided. But quantifying evidence through the likelihood ratio statistic renders the question of experimenter intention unimportant. We have and For testing we get From (1), we compute the approximation The exact value of the likelihood ratio statistic is computed as
The use of in approximating connects the Wald statistic to the likelihood ratio statistic. A statistic also leads directly to the calculation of a p-value. Since depends on the data through test statistic alone, then is a function of the corresponding p-value. Therefore, in the regular case, one can define a p-value which does satisfy the likelihood principle. No matter the intended sampling scheme in our example, the p-value for a two-sided alternative is seen from the computed Wald statistic to be
We will extend the connection between a likelihood ratio statistic and a p-value to a more general case. Before that, let’s think about some consequences of the regular case. We note that the development could proceed from the asymptotics of the likelihood ratio statistic directly. The Wald statistic appears naturally in the regular case, so no extra difficulty is caused by its consideration. Since the likelihood function is invariant to sampling scheme, so is the Wald statistic. Specifically, the standard error does not depend on the underlying sampling distribution. Let’s demonstrate this by comparing the binomial and negative binomial sampling distributions. In both cases, In the binomial setting, is the random variable with The estimated variance becomes
In the negative binomial setting, is the random variable. Applying the delta method leads to the asymptotic variance The estimated variance here becomes
Thus, is identical across sampling schemes. This property holds true whenever the likelihood belongs to the regular case. It is interesting to see that the variance parameter does depend on the sampling distribution. Test statistics based on evaluating the variance parameter at the null value are not invariant to the sampling scheme. An example of such a test statistic is the score statistic. Some prefer the score statistic in hypothesis testing because its error rate properties better approximate the stated levels  . However, a score statistic does not satisfy the likelihood principle. Under the Fisher viewpoint, the goal of hypothesis testing is to provide a statistical measure of evidence for the case at hand. Error rates for (hypothetical) repeated trials hold no sway under this philosophy. The Wald statistic would thus be preferable under the evidentiary view.
The arrangement which binds a p-value with the likelihood principle is beneficial to both schools of thought. As mentioned previously, the likelihood ratio scale for evidence lacks the familiarity of the p-value scale. The approximation in (1) allows one to more easily interpret a likelihood ratio. Translating to to leads to an evidential equivalence displayed in Table 2.
A likelihood ratio near 0.15 is the evidential equivalent of a two-sided p-value near 0.05. The 1 in 20 rule applied to the likelihood ratio would translate to a more stringent rule than the rule prevalent throughout much of statistical practice. Table 2 is our link between two seemingly disparate approaches to quantifying evidence.
We still need a way to unambiguously connect the p-value to the likelihood ratio for problems outside of the regular case. Evidence measured on the likelihood ratio scale is interpreted the same, whether from the regular case or not. Thus, we have an unambiguous measure of evidence against a null hypothesis on the likelihood ratio scale. We can read this in Table 2 as the right most column. The answer we are looking for can be found by reading Table 2 from right to left. For any likelihood ratio statistic, there exists a translated statistic. Note that such a statistic need not actually exist. We are only interested in the equivalence to some value on the evidence scale. We can then translate this into a p-value measure of evidence. In other words, any likelihood ratio can be uniquely translated into a p-value. We thus have a p-value, or at least an evidential measure on the p-value scale, which satisfies the likelihood principle.
Table 2. LR scale of evidence.
Let’s demonstrate the computation of a likelihood based p-value by returning one last time to the simple binomial example. The likelihood function is not of the regular case, but that does not matter. Earlier in this section, we computed We can connect a likelihood ratio to a statistic by solving (1) as
For our problem, we get We can easily connect a statistic to a p-value. Since the alternative hypothesis is one-sided, we can compute No matter the frequentist intention for the experiment, the calculations for remain the same. The result is an unambiguous p-value calculation. One can use a p-value measure of evidence while adhering to the likelihood principle.
Any testing problem where evidence can be quantified through the likelihood function can also be quantified through a uniquely defined measure on the p-value scale. We can think of this measure as defining a p-value which does indeed satisfy the likelihood principle.
4. Bayes Factors
A second deficiency to be addressed is that the p-value as an assessment of evidence accounts only for the direction of the alternative hypothesis, and not for a specified alternative value. We will explore this issue further by studying the Bayes factor. Let’s demonstrate the use of Bayes factors for quantifying evidence through a development analogous to the regular case of the likelihood function. Consider a test statistic distributed conditional on parameter as Under this parameterization, is interpreted as times a dimensionless measure of effect size. So, the mean of grows with the sample size when the effect size is nonzero.
We are interested in testing the null hypothesis of no effect, Taking a Bayesian approach, define as the prior probability on the null. For now, we take a point prior on the alternative as well; where The posterior probability on the null conditional on observing becomes
where are the null and alternative densities, respectively, for The Bayes factor, is introduced by writing the posterior odds as
where The Bayes factor is an evidence measure closely related to likelihood inference. A goal of quantifying evidence is to isolate the data information from any prior scientific knowledge on the nature of the phenomenon under investigation. Within the Bayesian framework, it is reasonable to quantify evidence by the change in probability from the prior (before data) to the posterior (after data). The Bayes factor measures this change as the ratio of the posterior odds to the prior odds. If the odds in favor of the null have decreased, so the data evidence is against the null hypothesis. If the odds in favor of the null have increased after data observation. The evidence from the data is then in favor of the null. (Bayes factors, unlike p-values and likelihood ratio statistics, can be used to determine when evidence favors a null hypothesis.)
Let’s consider a simple example to show why the specification of an alternative hypothesis matters when defining a measure of evidence. Suppose we are testing against the specific alternative Suppose further that we observe The one-sided p-value of represents moderate evidence against the null according to the Fisher scale. The Bayes factor is computed to be As the data evidence is against the null, consistent with the information from the p-value. Now, consider testing against the specific alternative Recall that depends on the sample size. We can imagine a test here similar to the first, but with a larger sample size. Suppose we again observe On the surface, it would appear that we have replicated the result from the first experiment. Once again, there is evidence against the null as judged by the p-value. The Bayes factor tells us a different story. For the replicated data, we compute Since the data evidence is actually in favor of the null hypothesis. Neither hypothesis is particularly compatible with the observed data, but the null model provides a better fit than the specified alternative. A small p-value is only an indication of the null fit. To properly quantify evidence, one needs an assessment under the alternative hypothesis as well. This idea is summarized in  : “The clear message is that knowing the data are rare under is of little use unless one determines whether or not they are also rare under ”
Let’s make the problem more general by taking a continuous prior over the alternative values for Write The Bayes factor is now written as
It is worth noting that must be a proper prior in order for the integral in (2) to exist. A Bayesian cannot fall back upon objective or noninformative priors for testing problems involving a precise null hypothesis. That one must specify a proper alternative prior also follows intuitively from our discussion on how the Bayes factor as a measure of evidence requires a characterization of the alternative.
The requirement that one must know something specific about the alternative hypothesis is not just a consequence from Bayesian testing. An analysis of type II error probabilities is also based on a specified alternative. But suppose we resist setting a specific alternative hypothesis. This idea is not new, nor is it without merit. Fisher did not consider the specification of an alternative to be an important aspect of a testing problem. The feud between Fisher and Neyman was in part over the Neyman-Pearson reliance on error rates  . One could argue, as Fisher did, that the goal of a testing problem should be to identify any evidence which contradicts a null hypothesis. Where does this leave us in our attempt to link these contrasting philosophies for quantifying evidence? While we will not find a complete success in this regard, it does happen that one can partially bridge the gap between p-values and Bayes factors.
As with p-values and likelihood ratio statistics, smaller values of Bayes factor represent greater evidence against the null hypothesis. An evidence scale for interpreting a Bayes factor was initially proposed by Jeffreys, then modified slightly  (see Table 3).
Let’s continue with our discussion on the regular case. The Bayes factor in (2) requires specification of an alternative; a p-value does not. The p-value philosophy is consistent with the idea of searching for evidence across the entirety of the alternative space. This idea can be put into play by determining the specific alternative best supported by the observed data. In our problem, that idea translates into an alternative prior placing full weight at The maximum denominator value of 1 is achieved for this use of a prior. The result is a lower bound for the Bayes factor, given by
We further recognize the right hand side of the inequality in (3) as the likelihood ratio statistic from our discussion in Section 3. In fact, one can show that the inequality holds for the general problem of testing We showed how to link the p-value, and statistic, to the likelihood ratio statistic. We then extend this link to the (minimum) Bayes factor through inequality (3).
An extended discussion of the positive implications which may arise from a shift to thinking about p-values in conjunction with Bayes factors is provided in  . We add to this discussion by noting how the relationship between a p-value and a Bayes factor puts the results from Section 3 in a different light. Let’s return to the coin spinning example. The p-value for testing the hypothesis of a fair coin was computed to be The corresponding likelihood ratio statistic, represents a lower bound on Since small values in-
Table 3. BF scale of evidence.
dicate stronger evidence, the likelihood ratio serves as an upper bound on the strength of evidence against the null. A Bayes factor would constitute positive evidence against the null on the Jeffreys scale, similar to a description of moderate evidence as determined from the p-value on the Fisher scale. But since the actual Bayes factor cannot be reconstructed precisely, a bound on the evidence measure is the best one can achieve from the p-value. An interpretation of a p-value as a measure of evidence must be tempered by the realization that such a measure is computed under the best case scenario for the alternative. This is a good reminder for an investigator to be careful about overstating evidence summarized through a p-value.
5. Concluding Remarks
An understanding of what can be implied from hypothesis testing results is a necessary obligation for a conscientious scientist. There is much debate as to the role of the p-value in scientific reasoning and discussion. Criticism over the use of the p-value tends to focus on its deficiencies in comparison to more rigorously defined evidential measures. We have seen, however, that a p-value measure of evidence can be defined under the likelihood principle. Furthermore, we have seen that the information from a p-value is related to a measure of evidence provided by a Bayes factor. The connection between p-values and likelihood based measures of evidence broaden the use of the p-value in statistical hypothesis testing. Even if one desires a quantification of evidence through the likelihood principle, or through a Bayes factor, the p-value can still be a useful instrument.
 MacKenzie, D. (2002) Euro Coin Accused of Unfair Flipping. New Scientist.