Bayes Factor with Lindley Paradox and Tow Standard Methods in Model
Abstract: For any statistical analysis, Model selection is necessary and required. In many cases of selection, Bayes factor is one of the important basic elements. For the unilateral hypothesis testing problem, we extend the harmony of frequency and Bayesian evidence to the generalized p-value of unilateral hypothesis testing problem, and study the harmony of generalized P-value and posterior probability of original hypothesis. For the problem of single point hypothesis testing, the posterior probability of the Bayes evidence under the traditional Bayes testing method, that is, the Bayes factor or the single point original hypothesis is established, is analyzed, a phenomenon known as the Lindley paradox, which is at odds with the classical frequency evidence of p-value. At this point, many statisticians have been worked for this from both frequentist and Bayesian perspective. In this paper, I am going to focus on Bayesian approach to model selection, starting from Bayes factors and going within Lindley Paradox, which also briefly talks about partial and fractional Bayes factor. Trying to use a simple way to consider this paradox is the thing what I want to do in the paper. On the other hand, a detailed derivation of BIC and AIC is given in Section 4. The guiding principle of selecting the optimal model is to investigate from two aspects: one is to maximize the likelihood function, the other is to minimize the number of unknown parameters in the model. The larger the likelihood function value, the better the model fitting, but we can not simply measure the model fitting accuracy, which leads to more and more unknown parameters in the model, and the model that becomes more and more complex would have caused an overmatch. Therefore, a good model should be the combination of the fitting accuracy and the number of unknown parameters to optimize the configuration.

1. Introduction

Many statisticians are naturally involved in the question of model selection , in case to define the “best model” to fit real data, different approaches have been proposed since last century, many well-known methods such as F-test , AIC, BIC , Bayesian model averaging . We are focusing on Bayesian approach, as we analyze data from some possible models ${M}_{1},\cdots ,{M}_{k}$. We denote $\theta \in \Theta$ as parameter and $\pi \left(\theta \right)$ as prior probability, then for likelihoods $f\left(Data|\theta \right)$ and prior $g\left(\theta \right)$. The posterior for model ${M}_{k}$ with parameter ${\theta }_{k}$ is proportional to ${f}_{k}\left(Data|{\theta }_{k}\right){g}_{k}\left({\theta }_{k}\right){\pi }_{k}$, we get posterior probability as

$\begin{array}{c}P\left({M}_{k}|Data\right)\propto {\pi }_{k}{\int }_{{\Omega }_{k}}{f}_{k}\left(Data|{\theta }_{k}\right){g}_{k}\left({\theta }_{k}\right)\text{d}{\theta }_{k}\\ =\frac{{\pi }_{k}{\int }_{{\Omega }_{k}}{f}_{k}\left(Data|{\theta }_{k}\right){g}_{k}\left({\theta }_{k}\right)\text{d}{\theta }_{k}}{{\sum }_{j=1}^{K}{\pi }_{j}{\int }_{{\Omega }_{j}}{f}_{j}\left(Data|{\theta }_{j}\right){g}_{j}\left({\theta }_{j}\right)\text{d}{\theta }_{j}}\end{array}$

In a Bayesian analysis, the priors ${\pi }_{k}$ on each model and ${g}_{k}\left({\theta }_{k}\right)$ on the parameters of model k are proper and subjective. And the Bayesian solutions to do questions are to compute the posterior probability $P\left({M}_{k}|Data\right)$ for each model. For model selection, we would choose the model from Bayesian conclusion as maximizes $P\left({M}_{k}|Data\right)$.

However, Bayes factor has its only limitation, that is Bayes factors itself can only show the difference of how hypothesis model is against a null model . Also, Bayes factor has a close connection with priors, if we change the width of the prior, it will also change the Bayes factor. At this point, we may need to consider about Lindley Paradox.

In Section 2, we give a simple and general explanation of Bayes factor. Following, in Section 3, we will talk about Lindley’s Paradox. And Section 4 can be one of the main parts of the theoretical approach for AIC and BIC, for which we give the derivation. A simple example is given as well to use AIC and BIC.

2. Bayes Factor

Before talking about all things, first we would construct one of the most important variables within Bayesion Methods-Bayes Factor .

Suppose we have data D with prior $\theta$ and ${M}_{1},{M}_{2}$ as two different models. By Condition Rule, we have:

$P\left({M}_{1}|D\right)=\frac{P\left(D|{M}_{1}\right)P\left({M}_{1}\right)}{P\left(D\right)}$

Recall for Odds we have $P\left({M}_{1}\right)=1-P\left({M}_{2}\right)$. And for $P\left(D|{M}_{1}\right)$ is the marginal likelihood, which $P\left(D|{M}_{1}\right)=\int \text{ }P\left(D|\theta ,{M}_{1}\right)P\left(\theta |{M}_{1}\right)\text{d}\theta$. $\theta$ denotes prior. Then, by Bayes’ Rule,

$P\left({M}_{1}|D\right)=\frac{P\left(D|{M}_{1}\right)P\left({M}_{1}\right)}{P\left(D|{M}_{1}\right)P\left({M}_{1}\right)+P\left(D|{M}_{2}\right)P\left(M2\right)}$

$\frac{P\left({M}_{1}|D\right)}{P\left({M}_{2}|D\right)}=\frac{P\left(D|{M}_{1}\right)}{P\left(D|{M}_{2}\right)}\frac{P\left({M}_{1}\right)}{P\left(M2\right)}$

where $\frac{P\left(D|{M}_{1}\right)}{P\left(D|{M}_{2}\right)}$ is defined as Bayes’ factor, and realized it is also the ratio of marginal likelihood. Furthermore, we denote Bayes’ factor as:

${B}_{1,2}\left(y\right)=\frac{P\left(y|{M}_{1}\right)}{P\left(y|{M}_{2}\right)}$

Bayesian method fits in many models for testing because it can provide a decisiveness of the evidence agree the null model in contrast p-values  which are usually just regarded as evidence mearsurement against the alternative . Also, the Bayes factor (Jefferys, 1961)  is used in Bayesian hypothesis. Assuming that ${p}_{i}\left(D|{\theta }_{i}\right),i=1,2$ are the likelihoods for D under two competing models ${H}_{1}$ and ${H}_{2}$, and the parameters are ${\theta }_{i}$. Meanwhile, let ${\pi }_{i}\left({\theta }_{i}\right)$ be their prior distributions . The Bayes factor for ${H}_{2}$ against ${H}_{1}$ :

${B}_{2,1}=\frac{\int \text{ }{P}_{2}\left(D|{\theta }_{2}\right){\pi }_{2}\left({\theta }_{2}\right)\text{d}{\theta }_{2}}{\int \text{ }{P}_{1}\left(D|{\theta }_{1}\right){\pi }_{1}\left({\theta }_{1}\right)\text{d}{\theta }_{1}}$

Above these, evidence from the data agrees ${H}_{2}$, against ${H}_{1}$. So Bayes factor can avoid many limitations in p-value testing. The development of Bayes factor in statistical models test can applicate in many areas of research .

The Lindley’s Paradox shows how a value (or the number of standard deviations) is used in a Frequent Assumption  test results in a completely different inference from Bayesian hypothesis .

When we faced with improper priors (like priors can’t integrate to one) in the null hypothesis and model selection, we will find some problems. Such priors can be acceptable, but for other purposes it is also acceptable. So we consider testing the hypotheses:

${H}_{0}:\theta \in {\Theta }_{0}\text{\hspace{0.17em}}\text{vs}\text{\hspace{0.17em}}{H}_{1}:\theta \in {\Theta }_{1}$

Defining $\theta$ for marginal density, so we can use the following model:

$p\left(\theta \right)=p\left(\theta |{H}_{0}\right)p\left({H}_{0}\right)+p\left(\theta |{H}_{1}\right)p\left(H1\right)$

Making $p\left(\theta |{H}_{0}\right)$ and $p\left(\theta |{H}_{1}\right)$ are proper density functions, the posterior is given by:

$\begin{array}{l}p\left({H}_{0}|D\right)\\ =\frac{p\left({H}_{0}\right)p\left(D|{H}_{0}\right)}{p\left({H}_{0}\right)P\left(D|{H}_{0}\right)+p\left({H}_{1}\right)p\left(D|{H}_{1}\right)}\\ =\frac{p\left({H}_{0}\right){\int }_{{\Theta }_{0}}\text{ }p\left(D|\theta \right)p\left(\theta |{H}_{0}\right)\text{d}\theta }{p\left({H}_{0}\right){\int }_{{\Theta }_{0}}\text{ }p\left(D|\theta \right)p\left(\theta |{H}_{0}\right)\text{d}\theta +p\left({H}_{1}\right){\int }_{{\Theta }_{1}}p\left(D|\theta \right)p\left(\theta |{H}_{1}\right)\text{d}\theta }\end{array}$

Then we can suppose that we use improper priors, making $p\left(\theta |{H}_{0}\right)\propto {z}_{0}$ and $p\left(\theta |{H}_{1}\right)\propto {z}_{1}$. So:

$\begin{array}{c}p\left({H}_{0}|D\right)=\frac{p\left({H}_{0}\right){z}_{0}{\int }_{{\Theta }_{0}}p\left(D|\theta \right)\text{d}\theta }{p\left({H}_{0}\right){z}_{0}{\int }_{{\Theta }_{0}}\text{ }p\left(D|\theta \right)\text{d}\theta +p\left({H}_{1}\right){z}_{1}{\int }_{{\Theta }_{1}}\text{ }p\left(D|\theta \right)\text{d}\theta }\\ =\frac{p\left({H}_{0}\right){z}_{0}{s}_{0}}{p\left({H}_{0}\right){z}_{0}{s}_{0}+p\left({H}_{1}\right){z}_{1}{s}_{1}}\end{array}$

Establishing model i that ${s}_{i}={\int }_{{\Theta }_{i}}\text{ }p\left(x|\theta \right)\text{d}\theta$ is the marginal likelihood or the integrated. So we assume that $p\left({H}_{0}\right)=p\left({H}_{1}\right)=\frac{1}{2}$

Then an equation can be obtained:

$p\left({H}_{0}|D\right)=\frac{{z}_{0}{s}_{0}}{{z}_{0}{s}_{0}+{z}_{1}{s}_{1}}=\frac{{s}_{0}}{{s}_{0}+\left(\frac{{z}_{1}}{{z}_{0}}\right){s}_{1}}$

So we can use different z that we want to change the posterior arbitrarily. Meanwhile, when using proper and not clear priors might cause similar problems. Because the probability of data in a complex model with a diffuse prior will be very small. So one thing we must know, when we do research in Bayes factor a clearer and simper model is better. It was called the Lindley paradox.

3.2. A Simple Model in Lindley Paradox

Many authors  have discussed this so-called paradox  in different ways . So I want to find a simple way to consider this problem. The usual point null hypothesis testing problem is to test:

${H}_{0}:\theta =0\text{\hspace{0.17em}}\text{vs}\text{\hspace{0.17em}}{H}_{1}:\theta \ne 0$

In normal model $N\left(x|\theta ,1\right)$. The prior probability is $p\left({H}_{0}\right)\text{\hspace{0.17em}}\left(p\left({H}_{0}\right)={\rho }_{0}\right)$.

Let $\pi \left(\theta \right)=N\left(\theta |0,{\sigma }^{2}\right)\text{\hspace{0.17em}}\left(\sigma >0\right)$ be the prior distribution for the unknowm parameter $\theta$ in the model.

The Bayes factor is given by:

$B=\frac{N\left(x|0,1\right)}{\int \text{ }N\left(x|\theta ,1\right)\pi \left(\theta \right)\text{d}\theta }$

In order to consider the paradox, we can formalise it and compare the two following normal models:

${M}_{0}=\left\{N\left(x|0,1\right)={\left(2\pi \right)}^{-\frac{1}{2}}\mathrm{exp}\left(-\frac{1}{2}{x}^{2}\right)\right\}$

${M}_{1}=\left\{N\left(x|\theta ,1\right)={\left(2\pi \right)}^{-\frac{1}{2}}\mathrm{exp}\left\{-\frac{1}{2}{\left(x-\theta \right)}^{2}\right\}\right\}$

Consider a physical system where quantity X may be measured and assume. And we need to use the $\sigma$ to define both the priors. The prior of the null hypothesis is ${\rho }_{0}$ supposing the ${\rho }_{0}$ can depend on $\sigma$.

Computing the Bayes factor representing the odds of the null hypothesis ${H}_{0}$ is:

${B}_{0}=\frac{N\left(x|0,1\right)}{\int N\left(x|\theta ,1\right)N\left(\theta |0,{\sigma }^{2}\right)\text{d}\theta }=\frac{{\text{e}}^{-\frac{1}{2}{x}^{2}}}{{\text{e}}^{\frac{-\frac{1}{2}{x}^{2}}{{\sigma }^{2}+1}}}\sqrt{{\sigma }^{2}+1}$

In this case, prior probabilities $p\left({H}_{0}\right)$ and $p\left({H}_{1}\right)=1-P\left({H}_{0}\right)$ for two hypotheses can be expressed. Given the result x, in Bayes theory that:

$p\left({H}_{m}|x\right)p\left(x\right)=p\left(x|{H}_{m}\right)p\left(Hm\right)$

for $m=0,1$, $p\left({H}_{m}\right)$ is prior probabilities and $p\left(x|{H}_{m}\right)$ is the conditional distribution, $p\left(x\right)=p\left(x|{H}_{0}\right)P\left({H}_{0}\right)+P\left(x|{H}_{1}\right)p\left({H}_{1}\right)$ can outcome the overall distribution. Posterior probability $p\left({H}_{m}|x\right)$ is in the hypothesis ${H}_{m}$. In Bayes theory we can evaluate the posterior probabilities, $p\left({H}_{0}|x\right)$ is given by:

$\begin{array}{c}p\left({H}_{0}|x\right)=\frac{p\left(x|{H}_{0}\right)p\left({H}_{0}\right)}{p\left(x\right)}\\ =\frac{p\left(x|{H}_{0}\right)p\left({H}_{0}\right)}{p\left(x|{H}_{0}\right)p\left({H}_{0}\right)+p\left(x|{H}_{1}\right)p\left({H}_{1}\right)}\\ ={\left[1+\frac{p\left(x|{H}_{1}\right)p\left({H}_{1}\right)}{p\left(x|{H}_{0}\right)p\left({H}_{0}\right)}\right]}^{-1}\\ ={\left[1+\frac{1-p\left({H}_{0}\right)}{p\left({H}_{0}\right)}\frac{p\left(x|{H}_{1}\right)}{p\left(x|{H}_{0}\right)}\right]}^{-1}\end{array}$

Then, we can use the mean value in prior distribution with ${\pi }_{m}\left(\theta \right)\equiv p\left(\theta |{H}_{m}\right)$ and make the rest of the prior probability as a normal distribution with variance $\tau$, so:

${\pi }_{1}\left(\theta \right)={\left(2\pi {\tau }^{2}\right)}^{-\frac{1}{2}}\mathrm{exp}\left\{\frac{-{\theta }^{2}}{2{\tau }^{2}}\right\}$

Evaluating the conditional probabilities:

$p\left(x|{H}_{m}\right)=\int \text{ }{\pi }_{m}\left(\theta \right)p\left(x|\theta \right)\text{d}\theta$

We can evaluate $p\left(x|{H}_{0}\right)$ and $p\left(x|{H}_{1}\right)$, overall:

$p\left({H}_{0}|x\right)={\left[1+\frac{1-{\rho }_{0}}{{\rho }_{0}}\frac{{\text{e}}^{-\frac{1}{2}{x}^{2}/\left({\sigma }^{2}+1\right)}}{{\text{e}}^{-\frac{1}{2}{x}^{2}}}\frac{1}{\sqrt{{\sigma }^{2}+1}}\right]}^{-1}={\left[1+\frac{1-{\rho }_{0}}{{\rho }_{0}}\frac{1}{{B}_{0}}\right]}^{-1}$

So we have an equation like before, we can talk about the prior $\rho \left(\sigma \right)$. Our approach is to measure the value of alternative assumptions about zero. In Asymptotically Bayesian attribute, if the model is incorrectly specified, the posterior will accumulates in the model. In the case of the Kullback-Leibler divergence, the closest to the real model . As a result, divergence ${D}_{K}L\left(N|x\left(\theta ,1\right)\parallel N\left(x|0,1\right)\right)$ represents the loss. Because we know the prior before. The excepted loss can be given:

${\int }_{\Theta }{D}_{K}L\left(N\left(x\theta ,1\right)\parallel N\left(x|0,1\right)\right)\pi \left(\theta \right)\text{d}\theta =\int \frac{1}{2}{\theta }^{2}\pi \left(\theta \right)\text{d}\theta =\frac{1}{2}{\sigma }^{2}$

The model prior represent the loss relatied with a probability statement, it also determined self-information loss function. So we have the prior on the alternative model is:

$1-{\rho }_{0}\left(\sigma \right)\propto {\text{e}}^{{\sigma }^{2}}$

The prior of the null hypothesis is $\rho \left(\sigma \right)\propto 1$, then we can get:

${\rho }_{0}\left(\sigma \right)=\frac{1}{1+\mathrm{exp}\left\{\frac{1}{2}{\sigma }^{2}\right\}}$

Then, this applies to the category of large $\sigma$ and $p\left({H}_{0}|x,\sigma \right)$ goes to zero, so $p\left({H}_{0}\right)\to 0$. Therefore, this method is consistent, we do not advocate the choice of big $\sigma$.

4. BIC and AIC

4.1. BIC

4.1.1. Notation (Table 1)

Table 1. Notation 1.

4.1.2. Derivation of BIC

In this section we are going to talk about the basic idea  of how BIC (Bayesian information criterion) constructed and given the derivation of BIC .

As what we have showed in section one, ${B}_{1,2}\left(y\right)=\frac{P\left(y|{M}_{1}\right)}{P\left(y|{M}_{2}\right)}$ as Bayes factor for two models, then we consider more models ${M}_{i}$ which $i\in \left\{1,\cdots ,n\right\}$

$P\left(y|{M}_{i}\right)=\int f\left(y|{\theta }_{i}\right)=\int \mathrm{exp}\left(\mathrm{log}\left(f\left(y|{\theta }_{i}\right){g}_{i}\left({\theta }_{i}\right)\right)\right)\text{d}{\theta }_{i}$

$f\left(y|{\theta }_{i}\right){g}_{i}\left({\theta }_{i}\right)$ where ${\theta }_{i}$ is the vector of parameters in the model ${M}_{i}$, L is the likelihood function and ${g}_{i}\left({\theta }_{i}\right)$ is the p.d.f. of the distribution of parameters ${\theta }_{i}$

Denoting ${\stackrel{˜}{\theta }}_{i}$ as the posterior mode, then we use Taylor expansion, let $Q\left({\theta }_{i}\right)=\mathrm{log}\left(f\left(y|{\theta }_{i}\right){g}_{i}\left({\theta }_{i}\right)\right)$, $\begin{array}{c}Q\left({\theta }_{i}\right)=\mathrm{log}\left(f\left(y|{\theta }_{i}\right){g}_{i}\left({\theta }_{i}\right)\right)\\ \approx \mathrm{log}\left(f\left(y|{\stackrel{˜}{\theta }}_{i}\right)\right){g}_{i}\left({\stackrel{˜}{\theta }}_{i}\right)+\left({\theta }_{i}-{\stackrel{˜}{\theta }}_{i}\right){{\nabla }_{{\theta }_{i}}Q|}_{{\stackrel{˜}{\theta }}_{i}}+\frac{1}{2}{\left({\theta }_{i}-{\stackrel{˜}{\theta }}_{i}\right)}^{\text{T}}{H}_{{\theta }_{i}}\left({\theta }_{i}-{\stackrel{˜}{\theta }}_{i}\right)\end{array}$.

where ${H}_{{\theta }_{i}}$ is a $|{\theta }_{i}||{\theta }_{i}|$ matrix such that ${H}_{mn}={\frac{{\partial }^{2}Q}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\stackrel{˜}{\theta }}_{i}}$, where $|{\theta }_{i}|={d}_{i}=dimension\left({\theta }_{i}\right)$. since Q attains its maximum, the Hessian matrix ${H}_{{\theta }_{i}}$ is negative definite. Let us denote ${\stackrel{¯}{H}}_{{\theta }_{i}}=-{H}_{{\theta }_{i}}$, and then approximate $P\left(y|{M}_{i}\right)$ :

$P\left(y|{M}_{i}\right)\approx \int \mathrm{exp}\left\{{Q|}_{{\stackrel{˜}{\theta }}_{i}}+\left({\theta }_{i}-{\stackrel{˜}{\theta }}_{i}\right){{\nabla }_{{\theta }_{i}}Q|}_{{\stackrel{˜}{\theta }}_{i}}+\frac{1}{2}{\left({\theta }_{i}-{\stackrel{˜}{\theta }}_{i}\right)}^{\text{T}}{H}_{{\theta }_{i}}\left({\theta }_{i}-{\stackrel{˜}{\theta }}_{i}\right)\right\}\text{d}{\theta }_{i}$

Then, by higher dimension normal distribution,

$\because \int \frac{1}{{\left(2\pi \right)}^{\frac{{d}_{i}}{2}}{|{\stackrel{˜}{H}}_{{\theta }_{i}}^{-1}|}^{\frac{1}{2}}}\mathrm{exp}\left(-\frac{1}{2}{\left({\theta }_{i}-{\stackrel{^}{\theta }}_{i}\right)}^{Τ}{\stackrel{\to }{H}}_{{\theta }_{i}}\left({\theta }_{i}-{\stackrel{^}{\theta }}_{i}\right)\right)\text{d}{\theta }_{i}=1$

$⇒\int \mathrm{exp}\left(-\frac{1}{2}{\left({\theta }_{i}-{\stackrel{^}{\theta }}_{i}\right)}^{Τ}{\stackrel{\to }{H}}_{{\theta }_{i}}\left({\theta }_{i}-{\stackrel{^}{\theta }}_{i}\right)\right)\text{d}{\theta }_{i}={\left(2\pi \right)}^{\frac{{d}_{i}}{2}}{|{\stackrel{˜}{H}}_{{\theta }_{i}}^{-1}|}^{\frac{1}{2}}$

$⇒P\left(y|{M}_{i}\right)=f\left(y|{\stackrel{˜}{\theta }}_{i}\right){g}_{i}\left({\stackrel{˜}{\theta }}_{i}\right){\left(2\pi \right)}^{\frac{{d}_{i}}{2}}{|{\stackrel{˜}{H}}_{{\theta }_{i}}^{-1}|}^{\frac{1}{2}}$

$⇒\mathrm{log}P\left(y|{M}_{i}\right)=\mathrm{log}f\left(y|{\stackrel{˜}{\theta }}_{i}\right)+\mathrm{log}{g}_{i}\left({\stackrel{˜}{\theta }}_{i}\right)+\frac{{d}_{i}}{2}\mathrm{log}\left(2\pi \right)+\frac{1}{2}\mathrm{log}|{\stackrel{˜}{H}}_{{\theta }_{i}}^{-1}|$

Furthermore, let us think about Weak Law of Large Numbers. For y is given data, $f\left(y|{\theta }_{i}\right)$ is the likelihood $L\left({\theta }_{i}|y\right)$ and L attains its maximum at the maximum likelihood estimate ${\theta }_{i}-{\stackrel{^}{\theta }}_{i}$.

We set ${g}_{i}\left({\theta }_{i}\right)=\left(\begin{array}{cc}1,& \theta \in \left[{\stackrel{˜}{\theta }}_{i}-\frac{1}{2},{\stackrel{˜}{\theta }}_{i}+\frac{1}{2}\right]\\ 0,& \text{else}\end{array}$, then each element in the matrix, ${\stackrel{˜}{H}}_{{\theta }_{i}}$, can be expressed as:

${\stackrel{˜}{H}}_{mn}=-{\frac{{\partial }^{2}\mathrm{log}L\left({\theta }_{i}|y\right)}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}$

Then, for ${\stackrel{˜}{H}}_{{\theta }_{i}}$ as a Fisher information matrix that,

$\begin{array}{c}{\stackrel{˜}{H}}_{mn}=-{\frac{{\partial }^{2}\mathrm{log}\left({\prod }_{j=1}^{n}L\left({\theta }_{i}|{y}_{j}\right)\right)}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}\\ =-{\frac{{\partial }^{2}{\sum }_{j=1}^{n}\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}\\ =-{\frac{{\partial }^{2}\left(\frac{1}{n}{\sum }_{j=1}^{n}n\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)\right)}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}\end{array}$

In this case, for the data ${y}_{1},\cdots ,{y}_{n}$ is IID, and n is large, we would apply Weak Law of Large number here, as random variable ${X}_{j}=n\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)$ we have $\frac{1}{n}{\sum }_{j=1}^{n}\text{ }\text{ }n\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)\stackrel{P}{\to }E\left(n\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)\right)$, Moreover, for Fisher information matrix:

$\begin{array}{c}{\stackrel{˜}{H}}_{mn}=-{\frac{{\partial }^{2}E\left[n\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)\right]}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}\\ =-{n\frac{{\partial }^{2}E\left[\mathrm{log}L\left({\theta }_{i}|{y}_{j}\right)\right]}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}\\ =-{n\frac{{\partial }^{2}E\left[\mathrm{log}L\left({\theta }_{i}|{y}_{1}\right)\right]}{\partial {\theta }_{m}\partial {\theta }_{n}}|}_{{\theta }_{i}={\stackrel{^}{\theta }}_{i}}\\ =n{I}_{mn}\end{array}$

$⇒|{\stackrel{˜}{H}}_{{\theta }_{i}}|={n}^{|{\theta }_{i}|}|{I}_{{\theta }_{i}}$

For which ${I}_{{\theta }_{i}}$ is the Fisher information matrix for a single data point ${y}_{1}$, and after substituting we final get for BIC:

$2\mathrm{log}P\left(y|{M}_{i}\right)=2\mathrm{log}L\left({\stackrel{^}{\theta }}_{i}|y\right)+2\mathrm{log}{g}_{i}\left({\stackrel{˜}{\theta }}_{i}\right)+|{\theta }_{i}|\mathrm{log}\left(2\pi \right)-|{\theta }_{i}|\mathrm{log}n-\mathrm{log}|{I}_{{\theta }_{i}}|$

4.2. AIC

4.2.1. Notation (Table 2)

Table 2. Notation 2.

4.2.2. Derivation of AIC

We can measure the quality of ${\stackrel{^}{p}}_{j}\left(y\right)$ (as an estimate of p) by the Kullback-Leibler distance  :

$\begin{array}{c}K\left(p,{\stackrel{^}{p}}_{j}\right)=\int \text{ }p\left(y\right)\mathrm{log}\left(\frac{p\left(y\right)}{{\stackrel{^}{p}}_{j}\left(y\right)}\right)\text{d}y\\ =\int \text{ }p\left(y\right)\mathrm{log}p\left(y\right)\text{d}y-\int \text{ }p\left(y\right)\mathrm{log}{\stackrel{^}{p}}_{j}\left(y\right)\text{d}y\end{array}$

So, we want to minimize $K\left(p,{\stackrel{^}{p}}_{j}\right)$ over j, which is the same as maximizing

${K}_{j}=\int \text{ }p\left(y\right)\mathrm{log}p\left(y|{\stackrel{^}{\theta }}_{j}\right)\text{d}y$

For calculating ${K}_{j}$, we can use Monte Carlo method to do an estimate

${\stackrel{¯}{K}}_{j}=\frac{1}{n}\underset{i=1}{\overset{n}{\sum }}\mathrm{log}p\left({Y}_{i}|{\stackrel{^}{\theta }}_{j}\right)=\frac{{\mathcal{l}}_{j}\left({\stackrel{^}{\theta }}_{j}\right)}{n}$

However, this estimate is very biased because the data are being used twice: first to get the MLE and second to estimate the integral by Monte Carlo method, and the bias is approximayely $\frac{{d}_{j}}{n}$. That means we should prove 

${\stackrel{¯}{K}}_{j}-\frac{d}{n}\approx {K}_{j}$

Choose ${\theta }_{{j}_{0}}$, s.t. $p\left(y|{\theta }_{{j}_{0}}\right)={\mathrm{max}}_{{\theta }_{j}\in {\Theta }_{j}}p\left(y|{\theta }_{j}\right)$, and let

$s\left(y,{\theta }_{j}\right)=\frac{\partial \mathrm{log}p\left(y|{\theta }_{j}\right)}{\partial {\theta }_{j}},\text{ }H\left(y,{\theta }_{j}\right)=\frac{{\partial }^{2}\mathrm{log}p\left(y|{\theta }_{j}\right)}{\partial {\theta }_{j}^{2}}$

So, $s\left(y,{\theta }_{j}\right)$ is the Jacobi martix of $\mathrm{log}p\left(y|{\theta }_{j}\right)$, and $H\left(y,{\theta }_{j}\right)$ is the Hessian martix of $\mathrm{log}p\left(y|{\theta }_{j}\right)$.

$\begin{array}{l}⇒{K}_{j}\approx \int p\left(y\right)\left(\mathrm{log}p\left(y|{\theta }_{{j}_{0}}\right)+{\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)}^{\text{T}}s\left(y,{\theta }_{{j}_{0}}\right)\begin{array}{c}\text{ }\\ \text{ }\end{array}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{ }\text{\hspace{0.17em}}+\frac{1}{2}{\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)}^{\text{T}}H\left(y,{\theta }_{{j}_{0}}\right)\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)\right)\text{d}y\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={K}_{0}-\frac{1}{2n}{Z}_{n}^{\text{T}}J{Z}_{n}\end{array}$

where

${K}_{0}=\int p\left(y\right)\mathrm{log}p\left(y|{\theta }_{{j}_{0}}\right)\text{d}y,\text{\hspace{0.17em}}\text{\hspace{0.17em}}{Z}_{n}=\sqrt{n}\left({\stackrel{^}{\theta }}_{j}-{\theta }_{{j}_{0}}\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}J=-E\left[H\left(y,{\theta }_{{j}_{0}}\right)\right].$

$\begin{array}{l}⇒{\stackrel{¯}{K}}_{j}\approx \frac{1}{n}\underset{i=1}{\overset{n}{\sum }}\left(\mathcal{l}\left({Y}_{i},{\theta }_{{j}_{0}}\right)+{\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)}^{\text{T}}s\left({Y}_{i},{\theta }_{{j}_{0}}\right)+\frac{1}{2}{\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)}^{\text{T}}H\left({Y}_{i},{\theta }_{{j}_{0}}\right)\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{ }={K}_{0}+{A}_{n}+{\left(\stackrel{^}{\theta }-{\theta }_{0}\right)}^{\text{T}}{S}_{n}-\frac{1}{2n}{Z}_{n}^{\text{T}}{J}_{n}{Z}_{n}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{ }={K}_{0}+{A}_{n}+\frac{{Z}_{n}^{\text{T}}{S}_{n}}{\sqrt{n}}-\frac{1}{2n}{Z}_{n}^{\text{T}}J{Z}_{n}\end{array}$

where

${A}_{n}=\frac{1}{n}\underset{i=1}{\overset{n}{\sum }}\left(\mathcal{l}\left({Y}_{i},{\theta }_{0}\right)-{K}_{0}\right),\text{ }{S}_{n}=\frac{1}{n}\underset{i=1}{\overset{n}{\sum }}\text{ }\text{ }s\left({Y}_{i},{\theta }_{0}\right)$

and

${J}_{n}=-\frac{1}{n}\underset{i=1}{\overset{n}{\sum }}\text{ }\text{ }H\left({Y}_{i},{\theta }_{0}\right)\stackrel{P}{\to }J$

${\stackrel{¯}{K}}_{j}-{K}_{j}\approx {A}_{n}+\frac{\sqrt{n}{Z}_{n}^{\text{T}}{S}_{n}}{n}\approx {A}_{n}+\frac{{Z}_{n}^{\text{T}}J{Z}_{n}}{n}$

From the knowledge of asymptotic distribution, we have three claims  :

Claim 4.1 ${Z}_{n}=\sqrt{n}\left(\stackrel{^}{\theta }-{\theta }_{{j}_{0}}\right)\to N\left(0,{J}^{-1}V{J}^{-1}\right)$, where $V=Var\left(s\left(Y,{\theta }_{{j}_{0}}\right)\right)={J}^{-1}$.

Claim 4.2 $\sqrt{n}{S}_{n}=\frac{\sqrt{n}}{n}{\sum }_{i=1}^{n}\text{ }\text{ }s\left({Y}_{i},{\theta }_{{j}_{0}}\right)\to N\left(0,V\right)$

Claim 4.3 Let $ϵ$ be a random vector with mean $\mu$ and covariance $\Sigma$, and $Q={ϵ}^{\text{T}}Aϵ$, then,

$E\left(Q\right)=trace\left(A\Sigma \right)+{\mu }^{\text{T}}A\mu$

So, with these calims above,

$\begin{array}{l}⇒E\left(\stackrel{¯}{K}-K\right)\approx E\left({A}_{n}\right)+E\left(\frac{{Z}_{n}^{\text{T}}J{Z}_{n}}{n}\right)=0+\frac{E\left({Z}_{n}^{\text{T}}J{Z}_{n}\right)}{n}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }=\frac{trace\left(J{J}^{-1}V{J}^{-1}\right)}{n}+{0}^{\text{T}}J0\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }=\frac{trace\left(V{J}^{-1}\right)}{n}=\frac{trace\left(I\right)}{n}=\frac{{d}_{j}}{n}\end{array}$

$⇒{\stackrel{^}{K}}_{j}=\frac{{\mathcal{l}}_{j}\left({\stackrel{^}{\theta }}_{j}\right)}{n}-\frac{{d}_{j}}{n}={\stackrel{¯}{K}}_{j}-\frac{{d}_{j}}{n}$

So, we define

$AIC\left(j\right)=2n{\stackrel{^}{K}}_{j}=2{\mathcal{l}}_{j}\left({\stackrel{^}{\theta }}_{j}\right)-2{d}_{j}$

4.3. Example of Simple Model

Let us consider again with the example in section 3, if we take data ${Y}_{1},\cdots ,{Y}_{n}\sim N\left(\theta ,1\right)$, and compare it with two models, such that, ${M}_{0}:N\left(0,1\right)$ and ${M}_{1}:N\left(\theta ,1\right)$. Then take the same hypothesis as in section 3.2, we test:

${H}_{0}:\theta =0\text{\hspace{0.17em}}\text{vs}\text{\hspace{0.17em}}{H}_{1}:\theta \ne 0$

By standard normal distribution we have,

$Z=\frac{\stackrel{¯}{Y}-0}{\sqrt{Var\left(\stackrel{¯}{Y}\right)}}=\sqrt{n}\stackrel{¯}{Y}$

In case to avoid Type I error in our test, for $\alpha =0.05$, by Z table, we would reject ${H}_{0}$ if $|Z|>\frac{{z}_{\alpha }}{2}\approx 1.96$ (we take $|Z|>2$ ). Which implies if $|\stackrel{¯}{Y}|>\frac{2}{\sqrt{n}}$, we reflect ${H}_{0}$.

Case 1: BIC

For what we have showed in section 4.1, we proved that $BIC=2\mathrm{log}L\left({\stackrel{^}{\theta }}_{i}|y\right)+2\mathrm{log}{g}_{i}\left({\stackrel{˜}{\theta }}_{i}\right)+|{\theta }_{i}|\mathrm{log}\left(2\pi \right)-|{\theta }_{i}|\mathrm{log}n-\mathrm{log}|{I}_{{\theta }_{i}}|$. However, in case to make comparison with two models, we could get away some unnecessary part, we take $BIC=\mathrm{log}L\left({\stackrel{^}{\theta }}_{i}|y\right)-\frac{|{\theta }_{i}|}{2}\mathrm{log}n$. Thus,

For ${H}_{0}$,

$BIC=\mathrm{log}L\left(0\right)-\frac{0}{2}\mathrm{log}n=-\frac{n{\stackrel{¯}{Y}}^{2}}{2}-\frac{n{S}^{2}}{2}$

and ${H}_{1}$,

$BIC=\mathrm{log}L\left(\stackrel{^}{\theta }\right)-\frac{1}{2}\mathrm{log}n=-\frac{n{S}^{2}}{2}-\frac{1}{2}\mathrm{log}n$

where ${S}^{2}={\sum }_{i}{\left({Y}_{i}-\stackrel{¯}{Y}\right)}^{2}$. If we want to choose ${M}_{1}$ as a better model, then we would make $-\frac{n{\stackrel{¯}{Y}}^{2}}{2}-\frac{n{S}^{2}}{2}<-\frac{n{S}^{2}}{2}-\frac{1}{2}\mathrm{log}n$, in other words, $|\stackrel{¯}{Y}|>\sqrt{\frac{\mathrm{log}n}{n}}$. And BIC is an estimate of a function of the posterior probability of a model under Bayesian setup.

Case 2: AIC

And from section 4.2, for $AIC=2{\mathcal{l}}_{j}\left({\stackrel{^}{\theta }}_{j}\right)-2{d}_{j}$, for which as what we have defined above ${S}^{2}={\sum }_{i}{\left({Y}_{i}-\stackrel{¯}{Y}\right)}^{2}$ that $\mathcal{l}\left(\theta \right)=-\frac{n{\left(\stackrel{¯}{Y}-\theta \right)}^{2}}{2}-\frac{n{S}^{2}}{2}$. Further deduce $AIC={\mathcal{l}}_{S}-|S|$ Thus,

For ${H}_{0}$,

$AIC=\mathcal{l}\left(0\right)-0=-\frac{n{\stackrel{¯}{Y}}^{2}}{2}-\frac{n{S}^{2}}{2}$

and ${H}_{1}$,

$AIC=\mathcal{l}\left(\theta \right)-1=-\frac{n{S}^{2}}{2}-1$

If we want to choose ${M}_{1}$ as a better model at this point, we would take $-\frac{n{\stackrel{¯}{Y}}^{2}}{2}-\frac{n{S}^{2}}{2}<\frac{n{S}^{2}}{2}-1$, implies $|\stackrel{¯}{Y}|>\frac{\sqrt{2}}{\sqrt{n}}$. Which AIC is estimate a constant plus the relative distance between unknow likelihood function.

5. Conclusion

The question of how to choose a best model and what is a best model, it is hard to define. More precise, the controversy has existed for a long time, and no doubt it will continue longer. In this paper, we have discussed Bayes factor in hypothesis. It is obviously that Bayes factor is increasingly used in many fields of statistic research. For Bayes factor standard methods, AIC and BIC, we would consider to use for model selection. However, we also should notice that for all methods they all have their own limitation, such as the sensitivity of priors in Lindley’s paradox. Even both frequentist and Bayesian statisticians have came up with different new ideas, it is still hard to be implemented or understand by all other. Moreover, from statistic point, the method also needs to be general enough to apply. Such as for Lindley’s paradox, the partial Bayes factor in case to avoid the sensitive of priors, it takes the minimal training sample from data set to get prior and then apply with rest of the data. Partial Bayes factor at some point did deduce the influence of sensitivity of prior, but how to find the minimal training sample could also be a hard problem. Same as fractional Beyes factor, even it proves the method of choosing data for partial Bayes facto, it still has many limitations we need consider.

Cite this paper: Nie, X. (2020) Bayes Factor with Lindley Paradox and Tow Standard Methods in Model. Open Journal of Statistics, 10, 74-86. doi: 10.4236/ojs.2020.101006.
References

   Vehtari, A., Gelman, A. and Gabry, J. (2017) Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC. Statistics and Computing, 27, 1413-1432.

   Hartman, B.M. and Groendyke, C. (2013) Model Selection and Averaging in Financial Risk Management. North American Actuarial Journal, 17, 216-228.
https://doi.org/10.1080/10920277.2013.824374

   Bayarri, M.J., Berger, J.O., Forte, A. and Garca-Donato, G. (2012) Criteria for Bayesian Model Choice with Application to Variable Selection. The Annals of Statistics, 40, 1550-1577.
https://doi.org/10.1214/12-AOS1013

   Wasserman, L. (2000) Bayesian Model Selection and Model Averaging. Journal of Mathematical Psychology, 44, 92-107.
https://doi.org/10.1006/jmps.1999.1278

   Ardila, F., et al. (2008) Root Polytopes and Growth Series of Root Lattices. SIAM Journal on Discrete Mathematics, 25, 1-17.

   Kass, R.E. and Raftery, A.E. (1999) Bayes Factors. Journal of the American Statistical Association, 90, 773-795.

   Bayarri, M.J. and Berger, J.O. (2004) The Interplay of Bayesian and Frequentist Analysis. Statistical Science, 19, 58-80.

   Spezzaferri, F., Verdinelli, I. and Zeppieri, M. (2007) Bayes Factors for Goodness of Fit Testing. Journal of Statistical Planning and Inference, 137, 43-56.
https://doi.org/10.1016/j.jspi.2005.09.002

   Jeffery, H. (1961) Theory of Probability.

   Bhat, H.S. and Kumar, N. (2010) On the Derivation of the Bayesian Information Criterion.

   Kadane, J.B. and Lazar, N.A. (2004) Methods and Criteria for Model Selection. Journal of the American Statistical Association, 99, 279-290.

   Merhav, N. and Feder, M. (1998). Universal Prediction. IEEE Transactions on Information Theory, 44, 2124-2147.
https://doi.org/10.1109/18.720534

   Lindley, D.V. (1957) A Statistical Paradox. Biometrika, 44, 187-192.
https://doi.org/10.1093/biomet/44.1-2.187

   Bartlett, M.S. (1957) A Comment on D. V. Lindley’s Statistical Paradox. Biometrika, 44, 533-534.
https://doi.org/10.2307/2332888

   Robert, C.P. (1993) A Note on Jeffreys-Lindley Paradox. Statistica Sinica, 3, 601-608.

   Villa, C. and Walker, S. (2015) On the Mathematics of the Jeffreys-Lindley Paradox. Communications in Statistics—Theory and Methods, 46, 12290-12298.

   Cousins, R.D. (2014) The Jeffreys-Lindley Paradox Discovery Criteria in High Energy Physics. Synthese, 194, 395-432.
https://doi.org/10.1007/s11229-015-0687-3

   Shively, T. and Walker, S.G. (2013) On the Equivalence between Bayesian and Classical Hypothesis Testing.

   Vardeman, S.B. (1987) Testing a Point Null Hypothesis: The Irreconcilability of p-Values and Evidence Comment. Journal of the American Statistical Association, 82, 130-131.
http://www.jstor.org/stable/2289136
https://doi.org/10.2307/2289136

   Penny, W.D. (2012) Comparing Dynamic Causal Models Using AIC, BIC and Free Energy. Neurolmage, 59, 319-330.
https://doi.org/10.1016/j.neuroimage.2011.07.039

   Zellner, A. and Siow, A. (1980) Posterior Odds Ratios for Selected Regression Hypotheses. Trabajos de Estadistica Y de Investigacion Operativa, 31, 585-603.
https://doi.org/10.1007/BF02888369

Top