In association studies, to avoid false positive results caused by population stratification, family-based tests of association are often used for fine mapping of a disease susceptibility locus. Transmission disequilibrium test (TDT) proposed by Spielman is an association test for case-parents triad data   . The classic TDT is McNemar test on a 2-way transmission/non-transmission table.
Multiple linked markers can provide more polymorphism information than single marker (especially SNPs). However, haplotype phase is often uncertain for multi-locus genotype. There may be several haplotype pairs compatible with observed genotype. Many haplotype reconstruction algorithms are developed, e.g. expectation-maximization (EM) algorithm  , pseudo MCMC  , Bayesian haplotype inference  for population data of unrelated individuals.
TDT has been extended for tightly linked marker loci. Zhao et al. (2000)  performed TDT via two steps: constructing the underlying haplotype first, and then constructing the 2-way transmission/non-transmission table in TDT by assigning a weight to each possible phase. Based on conditional likelihood, Clayton (1999) proposed a score test with program TRANSMIT  . The TDT-type methods were widely used in medical studies and GWAS studies   .
In this paper, based on full likelihood, we proposed a likelihood ratio test integrating haplotype construction for case-parent or case-parents data.
The key idea is from classical case-control association study. Association between trait and marker yields some allele or haplotypes are found more often in case group than control group. In case-parents data, the allele or haplotypes of the case are transmitted from the parents. And so, we can test whether a particular allele or haplotype exists more often in transmitted (case) than in non-transmitted (control).
For tightly linked marker loci, we treat haplotypes as extended alleles and use the transmission information to reduce the phase uncertainty.
Let H denote the number of haplotypes for l tightly linked loci. The set of all possible genotypes (haplotype pairs) is .
For a case-parents trios, the genotypes for father, mother and affected child are denoted by respectively. Let and denote haplotype frequency for transmitted and non-transmitted group, and denote relative risk for genotype i/j reference to H/H. Under multiplicative haplotype risk , where is relative risk for haplotype i reference to H. Under assumption of Hardy-Weinberg equilibrium, the probability of the father transmitted haplotype i and non-transmitted j, whereas the mother transmitted haplotype k and non-transmitted j to the child conditional on that the child is affected, is
where A means the child is affected, and
is regarded as frequency of haplotype i for disease population.
However, we only observed the genotype of each locus. The haplotype phase of the l tightly linked loci is often uncertainty, especial for missing parental genotype data. So there exist ambiguities to decide which haplotype is transmitted or not transmitted from the parent. Then
where is the set of haplotype groups which haplotype pairs , , and are compatible with and , respectively. Here, missing parental genotype is allowable.
Suppose there are N case-parents trios, and then there are 2N parents in all. The genotypes for the r-th trios are . The log-likelihood
It is difficult to find the maximum likelihood estimate (MLE) directly. We employ expectation-maximization (EM) algorithm to estimate the haplotype frequencies , by treating underlying haplotype pairs as “missing data”. The complete-data log-likelihood after adding missing data is
We can show that the expected complete-data log-likelihood in E (expectation) step (see Appendix)
An iterative procedure can be used to find the MLE via EM algorithm. Given the current estimates , the estimates in the next step:
. Under the null hypothesis , we have
Let denote the Likelihood under the null hypothesis or . The likelihood ratio statistic
follows an asymptotic c2 distribution with H-1 degrees of freedom (df) when the null hypothesis is true.
According to the relationship between and
We can get
The maximum likelihood estimator (MLE) of the relative risk for haplotype i relative to haplotype H is therefore given as
We apply our method to the published data which was used for family-based association analysis for immunoglobulin A nephropathy (IgAN)  . Two tightly linked loci C2093T and C2180T, located in the 3' untranslated region (UTR) in Megsin gene, were studied. This dataset contains 232 families with an affected child were entered into analysis (Table 1). There are missing data since the genotyping information of some parents are not available. Out of 232 nuclear families, only 125 (53.9%) families are complete case-parents trios for both loci, and 107 (46.1%) families are case-parent families for the locus C2093T or C2180T, which including 25 C2180T single parent families (SPF), 26 C2093T SPF, and 56 C2093T & C2180T SPF.
There exist 4 haplotypes for Megsin C2093T-C2180T, CC, CT, TC and TT, coded as 1, 2, 3 and 4. Give initial value and precision . The haplotype frequency estimates after the iterative procedure stops
The log-likelihood . For the null model, the haplotype frequency estimation
and log-likelihood . So likelihood ratio statistic
, df = 3, P = 0.0027. The results show the significant difference between F and F*, the transmitted haplotype frequencies
Table 1. 232 families entered into analysis.
SPF: single parent family.
and non-transmitted haplotype frequencies. is much higher than . That means haplotype 2 (2093C-2180T) is over-transmitted from parents to cases. In addition, reference to haplotype 4 (2093T-2180T), the estimated relative haplotype risk .
In our simulations, two tightly linked single nucleotide polymorphism (SNP) marker genotype data for 100 case-parents trios are generated. In a similar way in Morris et al. (1997)  , samples are generated for simulation using the transmission model provided by Bickeboller et al. (1995)  .
where is penetrance for genotype , and is prevalence.
The frequencies of mutant disease allele D and normal allele d in a disease locus are denoted by and. Under Hardy-Weinberg Equilibrium, the population prevalence is therefore. We assume that there is no recombination between disease and marker locus. Linkage diseqilibrium parameter to detect association between disease locus and marker locus is defined as in Sham (1995) 
where is frequency of disease-marker haplotype si. The satisfies
and. All implies linkage
equilibrium. means marker haplotype i is positively associated with the disease, and means marker haplotype i is negatively associated. In our simulations, the LD pattern is given as
i.e. marker haplotype 1 is positively associated with the disease, and the other haplotypes are equally negatively associated, if there is association. In addition, the marker haplotypes are assumed to be equally frequent, and,
. Three classical genetic models, recessive, dominant
and common models, specified heredity modes are considered. These models are shown in Table 2.
For each genetic model, 5000 replicated samples were generated to evaluate the distribution of test statistics in the case of no association, i.e. e = 1. The quantile-quantile (QQ) plots for test statistic from 5000 samples under the null hypothesis for the three genetic models are showed in Figures 1(a)-(c). The
Table 2. Genetic models for simulation study.
(a) (b) (c)
Figure 1. QQ plots for test statistic from 5000 samples. Red: expected quantiles of distribution versus observed quantiles; Green: the line y = x. (a) Recessive Model; (b) Dominant Model; (c) Common Model.
plots show that the scatters of the observed quantiles and the expected quantiles of distribution are very close to the line y = x.
In addition, 1000 replicated samples were generated for statistical power analysis. The statistical significance level a = 0.05 or 0.01. The empirical power for our proposed method, comparison with TRANSMIT, are summarized in Table 3, under different association level e for three disease model.
Table 3. Comparisons of empirical power.
Haplotype frequencies are usually estimated when haplotypes are reconstructed or linkage disequilibrium is tested. For tightly linked loci, the likelihood as a function of haplotype frequencies in transmitted and non-transmitted group was given for case-parents data. We estimate the MLEs of the haplotype frequencies via an EM algorithm. The results showed that haplotype frequencies could be estimated using a simple iterative procedure. The likelihood ratio test to compare haplotype frequencies in transmitted and non-transmitted group was used to detect association. When the information of parents is not available in the nuclear family, classical TDT is no longer suitable. However, as you can see in the application, missing parental genotypes are allowed in our method. In addition, when there are siblings available, the information can be used to reduce the uncertainty of phase, and the likelihood can be given similarly.
Under different simulated conditions where heredity mode, linkage disequilibrium coefficient are specified, 5000 and 1000 replicated samples were generated to evaluate the distribution of test statistics and statistical power respectively. Our method is more powerful than TRANSMIT.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
Appendix (E Step in EM Algorithm)
Give the current estimate, the distribution of missing data
And then the expected complete-data log-likelihood