A Comparison of Three Models in Multivariate Binary Longitudinal Data Analysis:

Show more

1. Introduction

Many of the recent medical experiments and social research are characterized by multiple outcomes. In this research, we focus on the multivariate binary outcomes in longitudinal data. It is a general approach of the univariate longitudinal data when we have more than one outcome. Each individual i has a vector of responses for different outcomes,
$k=1,2,\cdots ,K$ ,

${Y}_{i}={\left[{Y}_{i11},{Y}_{i21},\cdots ,{Y}_{iT1},{Y}_{i12},{Y}_{i22},{Y}_{i32},\cdots ,{Y}_{iT2},\cdots ,{Y}_{i1K},{Y}_{i2K},\cdots ,{Y}_{iTK}\right]}^{\text{T}}$

There are multiple methods for analyzing multivariate binary longitudinal data, which are different depending on the research aims. Let’s start by referring “outcome” to the longitudinal readings within the same dependent variable and “response” for any longitudinal reading of any dependent variables without specification. If the outcomes are uncorrelated, then we can analyze them as univariate longitudinal data separately. This independence assumption rarely happens in real applications because the outcomes, which are taken from the same observation, are highly probably to be correlated. The best statistical analysis of multiple outcomes should account for the correlation among the outcomes and the occasions since it is longitudinal data [1] [2] [3] . The three methods of analyzing longitudinal data are:

1) Reducing many outcomes into one summary outcome which results in estimating a unique set of regression coefficients [3] .

2) Analyzing each outcome separately with account for the correlations, which results in estimating a set of regression coefficients for each outcome [4] .

3) Analyzing the outcomes jointly and accounting for the correlations, which results in estimating a set of regression coefficients for each combination case of them (the proposed method).

Our investigation method of the longitudinal data via the marginal models is using GEE approach. The GEE approach of longitudinal data analysis is constructed for discrete and continuous outcomes^{1}. Then, applied the alternating logistic regression in multivariate binary model, which is dependent on modeling the association between responses by the odds ratio. The specification of the odds ratio matrix R, is the most important point on the GEE analysis of the

marginal models. We need to estimate $\frac{\left(KT\right)\left(KT-1\right)}{2}=\left(\frac{KT}{2}\right)$ parameters. It is

a large correlation matrix and it will rapidly increase when we have more parameters in R (correlation matrix). Since our interests are in binary multivariate longitudinal outcomes, then it is more natural to use the odds ratio to describe the association between the binary responses [3] . There are two applied methods to reduce the parameter estimations of the within subject association matrix R. The two methods are Kronecker product and regression model for pairwise odds ratio. The two methods are equivalent to each other and have approximately the same results. We will focus on the regression model for pairwise odds ratio to be method 1 in the comparison part.

Also, a SAS macro is built to analyze multivariate binary longitudinal outcomes using Kronecker product [4] . In this paper, we will consider the regression model for pairwise odds ratio to be the statistical tool of the first aim (method 1) [3] , and Kronecker product for the second aim (method 2) [4] . In the next section is an overview of the three methods. The application is in Section 3. The last section is about the discussion.

2. Methods

2.1. Method 1: Regression Modeling for Odds Ratio

The GEE model is $g\left({\mu}_{i}\right)={X}_{i}\beta $ where $E\left({Y}_{i}\right)={\mu}_{i}$ , and ${X}_{i}$ is the covariate matrix, assuming the vector of responses for subject i is related to the covariates via logit link function. The primary advantage of using GEE in multivariate binary data is working in two-way marginal relationships between the pairs of the responses for any dimension of outcomes and occasions without high order associations. The penalty of this advantage is the cluster size which is T × K. It will cause a larger correlation matrix R since the dimension of R matrix is TK × TK. R

is a diagonal and symmetric matrix, thus we need to estimate $\left(\begin{array}{c}TK\\ 2\end{array}\right)$ . For example,

if we have 5 outcomes obtained at 5 occasions, then we need to estimate 300 parameters, because that we need to obtain a parsimonious model of working

correlation R to be less than $\left(\begin{array}{c}TK\\ 2\end{array}\right)$ .

As we mentioned previously, we can use the odds ratio instead of the correlation since the responses are binary. Then, the matrix R will consist of pairwise log odds ratios [3] . It is based on fitting a regression model for marginal pairwise

odds ratio to estimate less parameters than $\left(\begin{array}{c}TK\\ 2\end{array}\right)$ [3] . Let $\gamma $ be a vector of size $\left(\begin{array}{c}TK\\ 2\end{array}\right)$ . of all non-redundant pairwise odds ratio in R. Here the goal is reducing

the length of the vector $\gamma $ . Consider the matrix of the odds ratio R consist of three odds ratio types:

1) Let ${v}_{tk.t{k}^{\prime}}$ be the inter-outcome odds ratio which compares the outcome k with the outcome ${k}^{\prime}$ at time t:

${v}_{tk.t{k}^{\prime}}=\frac{P\left({Y}_{tk}=1,{Y}_{t{k}^{\prime}}=1\right)P\left({Y}_{tk}=0,{Y}_{t{k}^{\prime}}=0\right)}{P\left({Y}_{tk}=1,{Y}_{t{k}^{\prime}}=0\right)P\left({Y}_{tk}=0,{Y}_{t{k}^{\prime}}=1\right)}$ (1)

2) Let a ${\alpha}_{tk,{t}^{\prime}k}$ be the intra-outcome odds ratio which compares outcome k at time t with the same outcome at time ${t}^{\prime}$ :

${\alpha}_{tk,{t}^{\prime}k}=\frac{Pr\left({Y}_{tk}=1,{Y}_{{t}^{\prime}k}=1\right)Pr\left({Y}_{tk}=0,{Y}_{{t}^{\prime}k}=0\right)}{Pr\left({Y}_{tk}=1,{Y}_{{t}^{\prime}k}=0\right)Pr\left({Y}_{tk}=0,{Y}_{{t}^{\prime}k}=1\right)}$ (2)

3) Let ${T}_{tk,{t}^{\prime}k}$ be the cross odds ratio which compares the outcome k at time t with outcome k at time ${t}^{\prime}$ :

${T}_{tk,{t}^{\prime}{k}^{\prime}}=\frac{Pr\left({Y}_{tk}=1,{Y}_{{t}^{\prime}{k}^{\prime}}=1\right)Pr\left({Y}_{tk}=0,{Y}_{t{k}^{\prime}}=0\right)}{Pr\left({Y}_{tk}=1,{Y}_{{t}^{\prime}{k}^{\prime}}=0\right)Pr\left({Y}_{tk}=0,{Y}_{{t}^{\prime}{k}^{\prime}}=1\right)}$ (3)

The general regression model that relates the inter-outcome (υ), the intra-outcome (α) and the cross odds ratio (τ) to matrix z via log link function [2] , is:

$\mathrm{log}\left(OR\right)=Z\gamma $ (4)

where is γ a parsimonious vector of the odds ratio parameters that relates the $\left(\begin{array}{c}KT\\ 2\end{array}\right)$ pairwise log odds ratio to Z. Here Z is a fixed indicator matrix that

specifies the outcomes, times and interactions between them which are under consideration. Using the regression model of (4) helps us to model the three types of association in matrix Z.

2.2. Method 2: Kronecker Product

To apply method 2 we used Kronecker Product approach [4] , they built a SAS macro to estimate separate sets of regression coefficients for each binary outcome. This method depends on the GEE estimation of the marginal models. The strength of this macro is creating a covariate design to allow for separate regression coefficients for each outcome. The GEE model is $\text{logit}\left({\mu}_{i}\right)={X}_{i}\beta $ where $E\left({Y}_{i}\right)={\mu}_{i}$ and ${X}_{i}$ is the covariate matrix and assuming the vector of responses for subject i is related to the covariate via logit link function.

Using Kronecker product of X matrix and K dimensional identity matrix be used to generate regression coefficients for each outcome. This method is beneficial to fit more than one longitudinal outcome using GEE approach, and accounts for the correlation among the outcomes. The results are separated estimation coefficients for each outcome, which is helpful for the second aim.

2.3. Method 3: The Proposed Approach

Let
${Y}_{itk}$ denotes the value of k^{th} binary outcome measured at t occasion or visit number for individual i, where
$i=1,2,\cdots ,N$ ,
$k=1,2,\cdots ,K$ , and
$t=1,2,\cdots ,T$ . Let
${X}_{itc}$ denotes the value of the covariate C that is measured at occasion t for individual i where
$c=1,2,\cdots ,C$ (C is the number of covariate variables).

Let ${Y}_{i}~\text{Bernolli}\left({\pi}_{1}\right),{Y}_{2}~\text{Bernolli}\left({\pi}_{2}\right),\cdots ,{Y}_{k}~\text{Bernolli}\left({\pi}_{K}\right)$ . Assuming the outcomes are obtained in the same time for same C covariates. We will collapse K outcomes into one outcome Z. Since Multinomial trial process is a simple generation of Bernoulli trial processes, then the new variable Z has multinomial distribution with $J=2K$ possible outcomes, ${Z}_{1},{Z}_{2},\cdots ,{Z}_{J}$ Suppose each possible outcome can occur with probability ${p}_{1},{p}_{2},\cdots ,{p}_{J}$ , then the probability of ${Z}_{1}$ occurs ${m}_{1}$ times, ${Z}_{2}$ occurs ${m}_{2}$ times, $\cdots $ , ${Z}_{J}$ occurs ${m}_{J}$ times is following the probability mass function:

$f\left({m}_{1},{m}_{2},\cdots ,{m}_{J}\right)=\left(\begin{array}{c}N\\ {m}_{1},{m}_{2},\cdots ,{m}_{J}\end{array}\right){p}_{1}^{{m}_{1}}{p}_{2}^{{m}_{2}}\cdots {p}_{J}^{{m}_{J}}$ (5)

where ${\sum}_{j-1}^{J}{p}_{j}}=1$ , ${\sum}_{j-1}^{J}{m}_{j}}=N$ . Then, $Z={\left({Z}_{1},{Z}_{2},\cdots ,{Z}_{J}\right)}^{\text{T}}$ has a multinomial distribution with $p=\left({p}_{1},{p}_{2},\cdots ,{p}_{J}\right)$ parameters. To collapse many binary vectors into one multinomial vector, we suggest to do transformation from a binary coding system to a decimal coding system. Then we analyze the new outcome Z instead of K binary vectors. Since the data is longitudinal, we consider a correlation among the new variable ${Z}_{it}$ . Then we use an appropriate exciting method for analyzing correlated nominal multinomial responses.

For correlated multinomial responses, there are many methods using random effects model and marginal models. For random effect models, Hedeker adopts a method in Bock’s model [5] . For marginal models, the GEE approach avoids specification of the distribution of the multinomial outcome by adopting a “working” correlation matrix. However, a GEE approach using a local odds ratio parameterization is one appropriate method to analyze the correlated data [6] . In the multinomial responses, we set up one of the possible outcomes to be a reference level and estimate the parameter in term of $J-1$ remaining levels compared to the reference level.

After estimating the parameters of nominal multinomial responses ${Z}_{it}$ , we must move to the “decoding” step. The decoding step is to transform each category in ${Z}_{itj}$ response to its original definition. For example, the parameter estimation ${\stackrel{^}{\beta}}_{1}$ for category z = 3 is for the original code “011”. Then, expresses the estimated parameter of the logit probabilities of accruing outcome 1 and outcome 2 simultaneously. In conclusion, using the GEE approach for nominal multinomial correlated responses could be helpful to analyze the joint distribution of many binary longitudinal outcomes.

3. Application

The Florida Dental Care Study (FDCS) is a longitudinal study of oral health and dental service utilization. It is conducted in four centers in North Florida in the United States [7] . The sample size is 873 subjects who had baseline interview exams and four clinical examination interviews at 6, 12, 18 and 24 months past baseline visit. By the end of 24 months, 87.5% of 873 remained in the study. 3.3% refused to participate, 1.1% unable to participate due to medical issues, 3.3% were deceased and 4% were dropout. The issue of bias was discussed by comparing the characteristics of the patients who remained in the study at 24 months, with those who did not for any reason [8] . Also, there are three binary outcomes that were measured over four time intervals (0 - 6, 6 - 12, 12 - 18, 18 - 24 months). In this application, a subset of five covariates were chosen in Table 1. The three binary outcomes measured in this study are “problem oriented visit”, “dental cleaning” and “dental check up”. At the end of each interval, each subject was asked whether he had visited a dentist within the past 6 months “problem oriented visit” and whether this dental visit was for a “dental cleaning” or “check up”. The three binary outcomes are coded as (0 = “No”, 1 = “Yes”). First, we consider the following three models.

Model 1:

$\text{logit}\left({\mu}_{i}\right)={\beta}_{0}+{\beta}_{1}{X}_{ira}+{\beta}_{2}{X}_{gender}+{\beta}_{3}{X}_{cavit}+{\beta}_{4}{X}_{loose}+{\beta}_{5}{X}_{able}$

Model 2:

$\text{logit}\left({\mu}_{i}\right)={\displaystyle {\sum}_{k=1}^{3}\left({\beta}_{k0}+{\beta}_{k1}{X}_{ira}+{\beta}_{k2}{X}_{gender}+{\beta}_{k3}{X}_{cavit}+{\beta}_{k4}{X}_{loose}+{\beta}_{k5}{X}_{able}\right)}$

Model 3:

$\text{logit}\left({\mu}_{i}\right)={\displaystyle \underset{j=1}{\overset{7}{\sum}}\left({\beta}_{j0}+{\beta}_{j1}{X}_{ira}+{\beta}_{j2}{X}_{gender}+{\beta}_{j3}{X}_{cavit}+{\beta}_{j4}{X}_{loose}+{\beta}_{j5}{X}_{able}\right)}$

$j=\{\begin{array}{c}\begin{array}{ll}\begin{array}{cc}1& \text{when}\end{array}\hfill & \left({Y}_{i1}=0\right)\&\left({Y}_{i2}=0\right)\&\left({Y}_{i3}=1\right)\hfill \end{array}\\ \begin{array}{cc}\begin{array}{cc}2& \text{when}\end{array}& \left({Y}_{i1}=0\right)\&\left({Y}_{i2}=1\right)\&\left({Y}_{i3}=0\right)\end{array}\\ \begin{array}{cc}\begin{array}{cc}3& \text{when}\end{array}& \left({Y}_{i1}=0\right)\&\left({Y}_{i2}=1\right)\&\left({Y}_{i3}=1\right)\end{array}\\ \begin{array}{cc}\begin{array}{cc}4& \text{when}\end{array}& \left({Y}_{i1}=1\right)\&\left({Y}_{i2}=0\right)\&\left({Y}_{i3}=0\right)\end{array}\\ \begin{array}{cc}\begin{array}{cc}5& \text{when}\end{array}& \left({Y}_{i1}=1\right)\&\left({Y}_{i2}=0\right)\&\left({Y}_{i3}=1\right)\end{array}\\ \begin{array}{cc}\begin{array}{cc}6& \text{when}\end{array}& \left({Y}_{i1}=1\right)\&\left({Y}_{i2}=1\right)\&\left({Y}_{i3}=0\right)\end{array}\\ \begin{array}{cc}\begin{array}{cc}7& \text{when}\end{array}& \left({Y}_{i1}=1\right)\&\left({Y}_{i2}=1\right)\&\left({Y}_{i3}=1\right)\end{array}\end{array}$

The reference category is 0 when ( ${Y}_{i1}=0$ ) & ( ${Y}_{i2}=0$ ) & ( ${Y}_{i3}=0$ ). Before running the three models, we measured the association between the binary responses using the odds ratio instead of the correlation. The odds ratio and the 95% confidence intervals are in Table 2. The association parameters are larger than one and significant for all the odds ratios because the 95% confidence intervals exclude the value one. Then, we expect a large differences between the three models. The strongest association is between “cleaning” and “check up” outcomes.

We ran the three models with no constraints in the log odds ratio, then we got 66 unique estimated odds ratio (#α = 12, #υ = 18, #τ = 36) in models 1 and 2. The results of the three models are in Tables 3-5. The covariates Cavit and Able are significant in model 1 while Gender and IRA are close and Loose is not. In model 2, the covariates are estimated for each outcome separately. The Gender covariate still not significant for all. However, The Loose parameter estimation is changed completely from model 1 to model 2 to be significant for all. This change

Table 1. Covariates variables.

Table 2. Estimated odds ratio.

Table 3. Model 1 results.

Table 4. Model 2 Results.

Table 5. Model 3 results.

due to separate the effect of the covariate for each outcome, when they are highly correlated. In model 3, we find Gender parameter estimation still not significant for the all. Loose parameter estimation becomes significant only for outcome ${Y}_{1},{Y}_{1\&2},{Y}_{1\&3}$ and ${Y}_{1,2\&3}$ which means the high association between the “problem visit” outcome and the others hides the effect of Loose covariate in the outcome in model 1. To get a closer look at the parameters’ differences in the three models, we aggregate the parameter estimations of all models for each covariate in Figure 1.

The dotplot in Figure 1 indicates many points. First, the variation of the parameter estimation between model 1, model 2 and model 3 are relatively. The variation in IRA and Gender parameter estimations over the three models is limited due to insignificant results. IRA parameter estimation is close to being significant in model 1, then model 2 explains the source of the effect which is the second outcome. The third model clarifies the effect after breaking up the responses for each joint case and finding that there is no significant estimations. In a contrary case, the parameter estimation of Able covariate starts significant in model 1 and 2 expect for the first outcome, then model 3 illustrates some significant parameters related to the significant association between outcomes 1 and 3. The Loose variable begins as not significant in model 1 but it is significant in model 2 for all outcomes while model 3 clarifies the sources of effect which is the associations between the three outcomes. The case of the intercept is clearly starting significant in model 1 and model 2, but model 3 explains the cause of the significance which are outcomes 1 and 3. The parameter of Cavit covariate shows how the third model changes the sign of the parameter to negative after using the coding method.

Figure 1. The parameter estimations of each covariant through the three models.

In summary, The FSCS study of three outcomes “problem visit”, “cleaning” and “check up” and four intervals is a good example of highly correlated responses that are used to investigate parameter estimation differences of the three models. Over this example, we see that the effect of each covariate on the responses varies over the three models. The most significant covariates over the three models are Cavity, Loose tooth, and the ability to pay an unexpected US$500 dental bill, but with difficulty.

4. Discussion

In this article, we review two existing methods for analyzing binary multivariate longitudinal data and proposed a third method. In most longitudinal experiments, there are more than one outcomes are obtained from the subjects over many occasions. Usually, the researcher is interested in investigating the effect of many independent variables on these outcomes. We applied three different methods to be beneficial in this case. The aim of this article is to give a more general overview of analyzing binary multivariate longitudinal data in case of expecting a correlation among the outcomes other than the correlation among the repeated measurements.

If the researcher is interested in estimating one group the estimated effects of covariates for all outcomes and accounting of the multivariate structure, then the first method is the most appropriate one. However, if the researcher is eager to separate the effects of covariates on the outcomes, then the second method is appropriate. The joint analysis of all of the outcomes in the third method is beneficial to investigate the relationship between the correlated outcomes and how the covariates are affecting on them. The third method consists of encoding and decoding stages. The encoding stage is a step of defining each case of associations between the outcomes into simple number. It is an idea of store of all the possible combinations of the outcomes into one outcome. We converted three binary outcomes into one multinomial outcome. The decoding stage is the last step of returning each code to its original definition.

Then, as a result, the joint analysis in method 3 is limited to small number of outcomes because the number of parameters of estimation is rapidly increasing when the number of outcomes is increased. It would be beneficial for future work to improve the third method to reduce the dimension of the parameters estimations by adding a filtering stage between the encoding and decoding steps. We presented a simulation part for many scenarios and application. The FCSD example is applied to the three methods and we saw how the parameter estimations of the three methods were changed over covariates. Each method explains more details starting from model 1, and moving to models 2 and 3; each method answers a different question. In conclusion, the ultimate chosen model depends on the research goal.

References

[1] Liang, K.Y. and Zeger, S.L. (1986) Longitudinal Data Analysis Using Generalized Linear Models. Biometrika, 73, 13-22. https://doi.org/10.1093/biomet/73.1.13

[2] Carey, V., Zeger, S. and Diggle, P. (1993) Modelling Multivariate Binary Data with Alternating Logistic Regressions. Biometrika, 80, 517-526.

https://doi.org/10.1093/biomet/80.3.517

[3] O’Brien, L. and Fitzmaurice, G. (2004) Analysis of Longitudinal Multiple-Source Binary Data Using Generalized Estimating Equations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53, 177-193.

https://doi.org/10.1046/j.0035-9254.2003.05296.x

[4] Shelton, B., Gilbert, G., Liu, B. and Fisher, M. (2004) A SAS Macro for the Analysis of Multivariate Longitudinal Binary Outcomes. Computer Methods and Programs in Biomedicine, 76, 163-175. https://doi.org/10.1016/j.cmpb.2004.05.005

[5] Hedeker, D. (2003) A Mixed-Effects Multinomial Logistic Regression Model. Statistics in Medicine, 22, 1433-1446. https://doi.org/10.1002/sim.1522

[6] Touloumis, A., Agresti, A. and Kateri, M. (2013) GEE for Multinomial Responses Using a Local Odds Ratios Parameterization. Biometrics, 69, 633-640.

https://doi.org/10.1111/biom.12054

[7] Gilbert, G., Duncan, R.P., Kulley, A., Coward, T. and Heft, M. (1997) Evaluation of Bias and Logistics in a Survey of Adults at Increased Risk for Oral Health Decrements. Journal of Public Health Dentistry, 57, 48-58.

https://doi.org/10.1111/j.1752-7325.1997.tb02472.x

[8] Gilbert, G., Duncan, R. and Vogel, W. (1998) Determinants of Dental Care Use in Dentate Adults: Six-Monthly Use during a 24-Month Period in the Florida Dental Care Study. Social Science U Medicine, 47, 727-737.

https://doi.org/10.1016/S0277-9536(98)00148-8