Many statisticians are naturally involved in the question of model selection , in case to define the “best model” to fit real data, different approaches have been proposed since last century, many well-known methods such as F-test , AIC, BIC , Bayesian model averaging . We are focusing on Bayesian approach, as we analyze data from some possible models . We denote as parameter and as prior probability, then for likelihoods and prior . The posterior for model with parameter is proportional to , we get posterior probability as
In a Bayesian analysis, the priors on each model and on the parameters of model k are proper and subjective. And the Bayesian solutions to do questions are to compute the posterior probability for each model. For model selection, we would choose the model from Bayesian conclusion as maximizes .
However, Bayes factor has its only limitation, that is Bayes factors itself can only show the difference of how hypothesis model is against a null model . Also, Bayes factor has a close connection with priors, if we change the width of the prior, it will also change the Bayes factor. At this point, we may need to consider about Lindley Paradox.
In Section 2, we give a simple and general explanation of Bayes factor. Following, in Section 3, we will talk about Lindley’s Paradox. And Section 4 can be one of the main parts of the theoretical approach for AIC and BIC, for which we give the derivation. A simple example is given as well to use AIC and BIC.
2. Bayes Factor
Before talking about all things, first we would construct one of the most important variables within Bayesion Methods-Bayes Factor .
Suppose we have data D with prior and as two different models. By Condition Rule, we have:
Recall for Odds we have . And for is the marginal likelihood, which . denotes prior. Then, by Bayes’ Rule,
where is defined as Bayes’ factor, and realized it is also the ratio of marginal likelihood. Furthermore, we denote Bayes’ factor as:
Bayesian method fits in many models for testing because it can provide a decisiveness of the evidence agree the null model in contrast p-values  which are usually just regarded as evidence mearsurement against the alternative . Also, the Bayes factor (Jefferys, 1961)  is used in Bayesian hypothesis. Assuming that are the likelihoods for D under two competing models and , and the parameters are . Meanwhile, let be their prior distributions . The Bayes factor for against :
Above these, evidence from the data agrees , against . So Bayes factor can avoid many limitations in p-value testing. The development of Bayes factor in statistical models test can applicate in many areas of research .
3. Priors and Lindley Paradox
3.1. Introduction to Lindley Paradox
The Lindley’s Paradox shows how a value (or the number of standard deviations) is used in a Frequent Assumption  test results in a completely different inference from Bayesian hypothesis .
When we faced with improper priors (like priors can’t integrate to one) in the null hypothesis and model selection, we will find some problems. Such priors can be acceptable, but for other purposes it is also acceptable. So we consider testing the hypotheses:
Defining for marginal density, so we can use the following model:
Making and are proper density functions, the posterior is given by:
Then we can suppose that we use improper priors, making and . So:
Establishing model i that is the marginal likelihood or the integrated. So we assume that
Then an equation can be obtained:
So we can use different z that we want to change the posterior arbitrarily. Meanwhile, when using proper and not clear priors might cause similar problems. Because the probability of data in a complex model with a diffuse prior will be very small. So one thing we must know, when we do research in Bayes factor a clearer and simper model is better. It was called the Lindley paradox.
3.2. A Simple Model in Lindley Paradox
Many authors  have discussed this so-called paradox  in different ways . So I want to find a simple way to consider this problem. The usual point null hypothesis testing problem is to test:
In normal model . The prior probability is .
Let be the prior distribution for the unknowm parameter in the model.
The Bayes factor is given by:
In order to consider the paradox, we can formalise it and compare the two following normal models:
Consider a physical system where quantity X may be measured and assume. And we need to use the to define both the priors. The prior of the null hypothesis is supposing the can depend on .
Computing the Bayes factor representing the odds of the null hypothesis is:
In this case, prior probabilities and for two hypotheses can be expressed. Given the result x, in Bayes theory that:
for , is prior probabilities and is the conditional distribution, can outcome the overall distribution. Posterior probability is in the hypothesis . In Bayes theory we can evaluate the posterior probabilities, is given by:
Then, we can use the mean value in prior distribution with and make the rest of the prior probability as a normal distribution with variance , so:
Evaluating the conditional probabilities:
We can evaluate and , overall:
So we have an equation like before, we can talk about the prior . Our approach is to measure the value of alternative assumptions about zero. In Asymptotically Bayesian attribute, if the model is incorrectly specified, the posterior will accumulates in the model. In the case of the Kullback-Leibler divergence, the closest to the real model . As a result, divergence represents the loss. Because we know the prior before. The excepted loss can be given:
The model prior represent the loss relatied with a probability statement, it also determined self-information loss function. So we have the prior on the alternative model is:
The prior of the null hypothesis is , then we can get:
Then, this applies to the category of large and goes to zero, so . Therefore, this method is consistent, we do not advocate the choice of big .
4. BIC and AIC
4.1.1. Notation (Table 1)
Table 1. Notation 1.
4.1.2. Derivation of BIC
In this section we are going to talk about the basic idea  of how BIC (Bayesian information criterion) constructed and given the derivation of BIC .
As what we have showed in section one, as Bayes factor for two models, then we consider more models which
where is the vector of parameters in the model , L is the likelihood function and is the p.d.f. of the distribution of parameters
Denoting as the posterior mode, then we use Taylor expansion, let , .
where is a matrix such that , where . since Q attains its maximum, the Hessian matrix is negative definite. Let us denote , and then approximate :
Then, by higher dimension normal distribution,
Furthermore, let us think about Weak Law of Large Numbers. For y is given data, is the likelihood and L attains its maximum at the maximum likelihood estimate .
We set , then each element in the matrix, , can be expressed as:
Then, for as a Fisher information matrix that,
In this case, for the data is IID, and n is large, we would apply Weak Law of Large number here, as random variable we have , Moreover, for Fisher information matrix:
For which is the Fisher information matrix for a single data point , and after substituting we final get for BIC:
4.2.1. Notation (Table 2)
Table 2. Notation 2.
4.2.2. Derivation of AIC
We can measure the quality of (as an estimate of p) by the Kullback-Leibler distance  :
So, we want to minimize over j, which is the same as maximizing
For calculating , we can use Monte Carlo method to do an estimate
However, this estimate is very biased because the data are being used twice: first to get the MLE and second to estimate the integral by Monte Carlo method, and the bias is approximayely . That means we should prove 
Choose , s.t. , and let
So, is the Jacobi martix of , and is the Hessian martix of .
From the knowledge of asymptotic distribution, we have three claims  :
Claim 4.1 , where .
Claim 4.3 Let be a random vector with mean and covariance , and , then,
So, with these calims above,
So, we define
4.3. Example of Simple Model
Let us consider again with the example in section 3, if we take data , and compare it with two models, such that, and . Then take the same hypothesis as in section 3.2, we test:
By standard normal distribution we have,
In case to avoid Type I error in our test, for , by Z table, we would reject if (we take ). Which implies if , we reflect .
Case 1: BIC
For what we have showed in section 4.1, we proved that . However, in case to make comparison with two models, we could get away some unnecessary part, we take . Thus,
where . If we want to choose as a better model, then we would make , in other words, . And BIC is an estimate of a function of the posterior probability of a model under Bayesian setup.
Case 2: AIC
And from section 4.2, for , for which as what we have defined above that . Further deduce Thus,
If we want to choose as a better model at this point, we would take , implies . Which AIC is estimate a constant plus the relative distance between unknow likelihood function.
The question of how to choose a best model and what is a best model, it is hard to define. More precise, the controversy has existed for a long time, and no doubt it will continue longer. In this paper, we have discussed Bayes factor in hypothesis. It is obviously that Bayes factor is increasingly used in many fields of statistic research. For Bayes factor standard methods, AIC and BIC, we would consider to use for model selection. However, we also should notice that for all methods they all have their own limitation, such as the sensitivity of priors in Lindley’s paradox. Even both frequentist and Bayesian statisticians have came up with different new ideas, it is still hard to be implemented or understand by all other. Moreover, from statistic point, the method also needs to be general enough to apply. Such as for Lindley’s paradox, the partial Bayes factor in case to avoid the sensitive of priors, it takes the minimal training sample from data set to get prior and then apply with rest of the data. Partial Bayes factor at some point did deduce the influence of sensitivity of prior, but how to find the minimal training sample could also be a hard problem. Same as fractional Beyes factor, even it proves the method of choosing data for partial Bayes facto, it still has many limitations we need consider.