In longitudinal studies, subjects who are likely to progress to a new state during the study are monitored over time. For example, in clinical trials, subjects who are at high risk of a certain disease are monitored and have follow-up visits. Some subjects complete all of their follow-up visits and their failure times are recorded. However, others miss their follow-up visits, and they may learn that the event of interest had already occurred when they came back. The event times for these patients are censored within the corresponding person-specific time intervals. Although there are multiple follow-up visiting intervals for each subject, researchers often use one particular interval that contains the true unknown failure time unless they had accurately determined the failure time. This is known as “partly interval-censored failure time data”. There are quite a few research works based on partly interval-censored data such as    and  among others.
Another commonly available data type in longitudinal studies is called pooled repeated observations. Subjects have multiple follow-up visits as usual. From every visit, a subject obtains a binary outcome for the event of interest. All those repeated binary outcomes are pooled together to develop a model to analyze the effects of time-dependent covariates on the occurrence of the event.  and  pooled such repeated observations with binary outcomes for the event of interest into a single sample. Then they used logistic regression model to estimate the effects of the risk factors on the occurrence of the event. Each observation interval is considered a mini follow-up study in which the current risk factors are updated to predict events in the interval. Once an individual has an event in a particular interval, all subsequent intervals from that individual are excluded from the analysis.
Now, we define pooled repeated partly interval-censored data. We have pooled repeated observations, but some binary outcomes and covariates are incomplete. They can only be determined with certain unknown probabilities within the corresponding specific follow-up visits. In this case, the analysis of such data requires a new method that combines a model that handles pooled repeated observations without censoring and a method that deals with partly or completely interval-censored data.
The main goal of this study is to estimate the effects of the time-dependent covariates on the occurrence of the event of interest (e.g., progression to a disease, becoming a frequent smoker, etc.). We extend the work of , who employed conditional expected score test (CEST) to determine the presence of association of a longitudinal marker and an event with missing binary outcomes to the estimation problem when the event of interest has a single progression state and the response is pooled, repeated, and partly interval-censored. We assume that the missing data is missing at random (MAR). In MAR data, there might be systematic differences between the observed and missing data, but the differences can be explained by the observed data. EM algorithm was originally developed to handle MAR data.
The organization of this paper is as follows. In Section 2, we present a logistic regression model for pooled repeated partly interval-censored data. In Section 3, we provide the details of computation of the MLEs of the regression parameter via EM algorithm and the variance estimation through the missing information principle. Section 4 displays the simulation study results. Section 5 illustrates an application to a real data set. Finally, Section 6 briefly summarizes what we have achieved and also discusses potential extensions of our work.
We consider a case of longitudinal studies, where subjects are at risk of an event of interest and have follow-up visits. Some subjects make complete follow-up visits, but others miss some of their follow-up appointments and come back after the event of interest has occurred. Whenever they miss a visit, both their binary outcome of the event of the interest and covariates are missing. Our proposed model estimates the effects of time-dependent covariates on the event of interest.
Let be the time subject i experiences the event of interest, . At the beginning of the study, every subject is assigned to the same follow-up visits, , . Let be the indicator of whether or not subject i has had the event of interest in the jth interval given a subject was event-free through and , the covariate at time . Since we are interested in modeling a binary outcome, we use a logit link to model the probability of event as in .
We construct the full (complete) log-likelihood, assuming as if there were no missing visits while subjects are in the study.
where is the index of the last time subject i was in the study.
3.1. Parameter Estimation
Assume that the ith subject missed visits after time and came back at . is the index of the last time subject i made the visit and was event-free. is the index of the first time subject i was observed with the event of interest. Then , , and is missing for . For the subjects who do not miss visits, . Whenever subjects miss visits, their covariate value, , is also missing. We use the EM algorithm (  ) to estimate the parameters.
E-step: For individuals whose failure times are interval-censored, we need to estimate both and in the expression (3) for .
could be continuous or categorical (  ). We assume that has a linear growth curve with fixed effects to incorporate a real data, NLSY97. That is,
where , . We estimate by for , where and are least squares estimators.
If is ordinal, we assign numbers to corresponding categories. Then we again assume linear growth curve with fixed effects to estimate the missing ’s. Let be the number of categories for this ordinal variable. For each individuali, the observed ’s are used in model (4) to compute and . Then we compute as usual.
Next, we create thresholds in order to uniquely assign into one of the categories. Note that and . We use the quantiles of this normal distribution to define the thresholds. Since we need to compute to define thresholds, we need at least three distinct observed covariate values, ’s for each subject, otherwise, would be undefined due to the zero degrees of freedom.
The observed ordinal covariates for some subjects do not include the entire ordinal categories. Therefore, the ordinal logistic regression model does not work for estimating ordinal covariates. In Appendix 2, we provide a detailed rationale for choosing fixed effects model, its extension in a general setting, and challenges with random effects model.
This is an extension of a geometric-type experiment, where the probability of success (progression) changes at each follow-up visit, , .
M-step: We find the values of and that maximize the expected value of log-likelihood in Equation (3), conditioned on the observed data. Therefore, we have
where , if uncensored.
Expressions (5)-(7) are repeated until convergence. As there are no closed forms for and , we used an optimization package optim in R to obtain .
3.2. Variance Estimation
We apply Louis’ method for variance estimation using the notation in . Following the missing information principle, we compute the observed information by subtracting the missing information from the complete information.
where W is observed data, i.e., partly interval-censored pooled repeated observations. V is latent data, the true unknown counterpart of the interval-censored portion of W. is the observed posterior and is the augmented posterior.
The details of the expression (8) are provided in Appendix 3.
4. Simulation Study
4.1. Data Simulation
We considered subjects who have follow-up visits each. We generated covariates as follows:
represents a continuous covariate with larger values and faster growth rate over time, while represents one with smaller values and slower growth rate over time.
First, we generate subjects who have complete follow-up visits. This makes the original complete data (OC), pooled repeated data. We randomly choose subjects out of these. This makes the exact data (E), a proper subset of the OC. For the remaining subjects, we randomly designate some of their follow-up visits missing. This makes the pooled repeated interval-censored observations. The observed data (O), which is the pooled repeated partly interval-censored data, is the mix of pooled repeated data (E) and pooled repeated interval-censored data. We considered several values for and to cover different proportions of exact data.
We randomly sampled and for each patient. Note that for the exact data, we have and for the pooled repeated interval-censored data, . Then for , we have and for , we have . is missing for in the pooled repeated interval-censored data. is 1 when the ith subject at risk at the jth visit experiences the event of interest in thejth interval.
We computed the bias and variance for original complete data, exact data, and observed data based on replications. In addition, we investigated the power of our test.
We first considered the case where there was only one attribute in the model. The EM algorithm (Section 3.1) was used for the parameter estimation. The variance of the parameter estimator was calculated using Louis’ method (Section 3.2).
The results are shown in Table 1. For all the different combinations of and , the proposed estimator based on the observed data produces a smaller bias and a smaller variance than that based on the exact data alone. In particular, for the case of (250, 50), containing E 84% (250) and only 16% (50) pooled-repeated interval-censored data, the proposed estimator produces a smaller bias and a smaller variance than that based on E alone. We also notice that the more exact data we have, the smaller bias and variance we get. These results have a quite similar pattern to those in , who employed a proportional hazards model with partly interval-censored data.  notes that pooled repeated observations logistic regression is close to the time-dependent covariate Cox regression analysis. Therefore, this simulation result coincides with what we expected. In order to see if bootstrap would be of help, we also ran simulations with various pairs of and to compare the bootstrap variance with the variances for the O, E, and OC. We considered two covariates; one is continuous and the other is ordinal. Table 2 shows the results. For all pairs of and , the bootstrap variance for the O is smaller than that for OC, which is supposed to be the smallest. This is, bootstrapping suffers from substantial underestimation. Therefore, we do not recommend it for this setting. Another issue is that it is time-consuming.
Next, we computed the power of the test vs. . We considered both one-dimensional covariate and two-dimensional covariates. We considered 3 sample sizes (100, 200, and 300), and for each of these sample sizes we ran replications of the test. The power was calculated as the proportion of times was rejected at 5% level of significance. Both Figure 1 and Figure 2 show the powers for different values of and different sample sizes. The power curves are symmetric for all the different sample sizes. As a sample size increases or the parameter values are farther apart from the true parameter value (i.e., an effect size increases), the corresponding power increases. From Figure 1, with a sample of size , one can achieve 80% power for the effect size of 0.45. Moreover, for the effect size of 0.55, a sample of size is enough to achieve 80% power.  achieved approximately 80% power in
Table 1. Results for 1-dimensional , , B: Bias, : variance.
Table 2. Estimated variance, boot: bootstrap.
Figure 1. Power of the test for one-dimensional .
detecting the effect size of 0.75 for the proportional hazards model with a sample of size 300 using current status data. Considering that pooled repeated partly interval-censored data has more information than current status data, we fully
Figure 2. Power of the test for multidimensional .
Table 3. The 95% coverage probabilities.
agree with this better power result. The 95% coverage probabilities for different proportions of pooled repeated partly interval-censored data are shown in Table 3.
In summary, even a small amount of pooled repeated interval-censored data within O does make our statistical inference more accurate and more powerful.
5. Analysis of NLSY97 Data
For more than 4 decades, the National Longitudinal Surveys (NLS) data have served as an important tool for economists, sociologists, and other researchers. The NLSY97 is a nationally representative sample of approximately 9000 youths who were 12 to 18 years old as of December 31, 1996. The NLSY97 is designed to document the transition from school to work and into adulthood. It collects extensive information about youths’ labor market behavior and educational experiences over time. In addition to educational and labor market experiences, the NLSY97 contains detailed information on many other topics. Some of the areas included in the data are criminal behavior, alcohol, and drug use. For the purpose of illustration of our methods, we use the NLSY97 data from 1997 to 2013 (  ). We illustrate how to analyze the effects of covariates that may affect an adolescent’s smoking behavior.
There are 8984 subjects in the data set. We analyze the 1822 subjects who did not smoke at the beginning of the study in 1997, but by the end of 2013 became frequent smokers (smoking for more than 10 days in a month). That is O. The response variable is defined as
Exact observations (E) are available in approximately 87.5% of those analyzed. The 1st covariate, , is the number of days an individual drank alcohol in the last 30 days. The 2nd covariate, , is an individual’s self-evaluation of “general state of health”. is defined as: 1 = excellent, 2 = very good, 3 = good, 4 = fair, and 5 = poor. The covariate effects are estimated by the EM algorithm in Section 0. The standard errors of these estimators are computed by Louis’ method in Section 0. The results are shown in Table 4. Fixing an individual’s self-evaluated health level as the subject drinks alcohol one more day during the past 30 days, the log of odds of becoming a frequent smoker increases by 0.1 (s.e. = 0.002). Furthermore, by fixing an individual’s amount of drink as the subject’s health level rises (i.e., gets worse) by one unit, the log of odds of becoming a frequent smoker increases by 0.19 (s.e. = 0.015).
Additionally, we analyzed only E from O in order to see how much smaller the pooled repeated interval-censored data can help make the analysis more accurate. Another rationale for this is some practitioners often analyze only E due to the unavailability of software. The results are shown in Table 5. The parameter estimates are very close to those from O. However, the estimated standard errors are much larger than those from O. This is consistent with the simulation results in Section 3.2. The Wald test statistic for testing is quite large for both E alone and O. Therefore, the p-values are nearly 0. Though both tests tell us that the covariates have a statistically significant effect on adolescent’s smoking behavior, O provides us with much stronger evidence for the
Table 4. The results of NLSY97 analysis using the observed data.
Table 5. The results of NLSY97 analysis using only the exact data.
effect. Therefore, this data analysis reaffirms that even a small amount of pooled repeated interval-censored portion of O increases the sensitivity of the test.
We focused on developing a method to estimate the regression parameters and the variance-covariance matrix of those estimators for the pooled repeated partly interval-censored data logistic regression model. We employed the EM algorithm to estimate the parameters and missing information principle to estimate the variance-covariance matrix of those estimators.
Monte Carlo simulation demonstrates acceptable levels of bias, standard error, and power. To our knowledge, this is the first extensive power study for the pooled repeated partly interval-censored data logistic regression model. The simulation results suggest that in practice, one needs a sample of size around 300 to achieve an 80% power of the test to detect a very small effect size (0.45) for the regression parameter of interest. However, one needs a much smaller size, only around 200, for a bit larger effect size (0.55).
There are several potential extensions of our methods. Our methods can also be used when the predetermined follow-up visits were person-dependent. Our methods can be extended to handle correlated covariates by employing a ridge regression model (  ), variable selections by lasso regression (  ), and multiple progression states due to the fact that the likelihood factors into a distinct term for each interval (  ).
Last but not least, we note that there are challenges in including either left-censoring or right-censoring. Refer to Appendix 1 for details.
The authors appreciate Dr. Alexis Dinno for introducing the data.
Appendix 1. Right and Left Censoring in the Model
In some special cases, the visiting time of some subjects in the data may have either right or left censoring. If a subject has not failed at the last visit ( ) and does not come back for the proceeding interview visits, then the subject’s time to the event of interest is right-censored. In this case and . As NLSY predetermined M for all subjects, M plays the role of .
One may want to impute the covariate, and reponse, according to the procedures in Section 3.1. Unfortunately, extrapolating the covariates for using the linear growth curve in Section 3.1 may well increase bias and variance.
If a subject’s first visit is at time k and the subject shows the symptoms of the event of interest, then both and are missing for , and . Therefore, the covariate, and response, should be estimated for at E-step. We merely have , , and two observed covariate values and . Therefore, we cannot fit the subject-dependent growth curve to estimate the covariates at the missed visits.
In summary, there is no merit to include individuals whose event-times are either left-censored or right-censored when fitting a logistic regression model with pooled repeated observations.
Appendix 2. Imputation of Covariates
In Section 3.1, we assumed that covariates have a linear growth curve with fixed effects. This was motivated by NLSY97 data. In NLSY97, follow-up interviews were relatively far apart (1 year). Additionally, some individuals had no change in their covariate values, e.g., some individuals had no drinking throughout the study. This motivated us to assume that for a given individual, the covariate values are uncorrelated at different follow-up visits, i.e., .
If the follow-up time intervals are relatively short and there are no constant covariate values for any individual over time, one may adopt a linear growth curve with fixed effects and autocorrelated errors. That is,
where is an autoregressive process with lag 1, AR (1), , .
 and  assumed random effects. In a linear growth curve with random effects, all subjects have the same growth curve distribution, which depends on time points and it is correlated within the same subject. The least squares estimators for this model are the same for all subjects. For example, assume that we get and for a random effect model. If two subjects and have missing covariates at a given time point j, then they will have the same estimated covariate for . This may cause a substantial amount of bias.
Appendix 3. Formulas for Computing the Variance in Section 3.2
The variance estimation is based on , the observed binary outcomes for subject i and the variability of missing response, conditioned on , where . Let , and . Then the complete information matrix in (8) can be computed by
The missing information in (8) is computed by Monte Carlo simulations.
 Gao, F., Zeng, D.L. and Lin, D.Y. (2017) Semiparametric Estimation of the Accelerated Failure Time Model with Partly Interval-Censored Data. Biometrics, 73, 1161-1168.
 Zhao, X.Q., Zhao, Q., Sun, J.G. and Kim, J.S. (2008) Generalized Log-Rank Tests for Partly Interval-Censored Failure Time Data. Biometrical Journal, 50, 375-385.
 Kim, J.S. (2003) Maximum Likelihood Estimation for the Proportional Hazards Model with Partly Interval-Censored Data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 489-502.
 Adrienne Cupples, L., D’Agostino, R.B., Anderson, K. and Kannel, W.B. (1988) Comparison of Baseline and Repeated Measure Covariate Techniques in the Framingham Heart Study. Statistics in Medicine, 7, 205-218.
 D’Agostino, R., Lee, M.L., Belanger, A., Cupples, L.A., Anderson, K. and Kannel, W.B. (1990) Relation of Pooled Logistic Regression to Time Dependent Cox Regression Analysis: The Framingham Heart Study. Statistics in Medicine, 9, 1501-1515.
 Finkelstein, D.M., Wang, R., Ficociello, L.H. and Schoenfeld, D.A. (2010) A Score Test for Association of a Longitudinal Marker and an Event with Missing Data. Biometrics, 66, 726-732.
 Dempster, A.P., Laird, N.M. and Rubin, D.B. (1997) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodologica), 39, 1-22.
 Masyn, K.E., Petras, H. and Liu, W. (2014) Growth Curve Models with Categorical Outcomes. In: Bruinsma, G. and Weisburd, D., Eds., Encyclopedia of Criminology and Criminal Justice, Springer, New York, 2013-2025.
 Mongoué-Tchokoté, S. and Kim, J.S. (2008) New Statistical Software for the Proportional Hazards Model with Current Status Data. Computational Statistics and Data Analysis, 52, 4272-4286.
 Bureau of Labor Statistics, U.S. Department of Labor (2015) National Longitudinal Survey of Youth 1997 Cohort, 1997-2013 (Rounds 1-16) Produced by the National Opinion Research Center, the University of Chicago and Distributed by the Center for Human Resource Research, The Ohio State University. Columbus.
 Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.