In survival analysis, it is often of interest to explore the relationship between the failure time and a collection of covariates. For this purpose, a large number of semiparametric regression models and estimation methods have been developed. Among them, the Cox  proportional hazards (PH) model may be the most popular and widely-used statistical tool for analyzing survival data partly due to the efficient inference based on the partial likelihood and the availability of implementation in almost all existing softwares. However, the PH model requires that the hazard ratio is always constant over time between any two subjects with distinct covariates. In some situations, this assumption seems to be rather restrictive and hard to be met. See . Thus, other alternatives to the PH model with non-proportional risks or more flexibility of modelling the covariates are desirable.
The AFT model assumes that the logarithm of survival time is linearly correlated to a vector of covariates of interest, which can be specified as
where T denotes the failure time; is a p-vector of regression coefficients to be estimated; Z is a p-vector of covariates; is the random error with mean zero but unknown distribution; and the superscript “T” denotes the transpose of a column vector. In the presence of right censored data, several semiparametric estimates have been proposed, such as the least square estimator in   and rank-based estimator in   . Nevertheless, these estimators have not been wildly used as it should be in practice due to lack of efficient and reliable computation algorithm to obtain the parameter and its standard error estimation. Especially, for the rank-based estimator, the computational challenge arises from two aspects: the estimating function used is non-smooth, i.e. a step function with respect to ; and the asymptotic slope matrix of the estimating function depends on the unknown hazard function of the error and its first derivative, which makes the direct estimation from the observed data impossible. Thus, most existing covariance matrix estimation methods rely on the bootstrap approach that is computation-demanding. The authors in  provided the first reliable and accurate estimating procedure via linear programming (LP) to obtain the Gehan estimator, a special case of the general weighted logrank estimator, which is implemented in R package “lss”. However, the merit of their LP strategy is greatly discounted for large, even modest sample sizes in  .
In view of these limitations, some more computationally efficient procedures for the Gehan estimator are introduced by    . Explicitly speaking, Brown and Wang  developed a pseudo-Bayesian approach in  to derive a smoothing version, which is asymptotically equivalent to the original discontinuous Gehan estimating function. They called this technique the induced smoothing (IS) method. A significant advantage is that the smoothed estimating function is differential so that one can estimate the regression parameters and the covariance matrix simultaneously with common numerical methods, avoiding the computationally extensive resampling. Later on, the theoretical justification for the IS method was provided by , and they extended the method to clustered failure time data. On the other hand, Hellner  considered to directly approximate the indicator function in the Gehan estimating function with a known distribution function and showed the resulting estimator converges in distribution to a normal random vector with mean zero and covariance matrix which can be straightforwardly estimable. A detailed review is available in , where they also suggested a polynomial-based smoothing method which has more well-behaved performance than the two counterparts mentioned above. This technique will also be applied to our problem depicted in this paper.
Although the AFT model is useful, the assumption that each covariate has a linear effect on the log survival time is not appropriate in some situations. For example, in many clinical trials and biomedical studies, one is primarily concern about identifying the effect of a treatment when a confounding factor of less interest exists. In such cases, it is reasonable and useful to treat the confounding factor as a nonparametric component without loss of the easy interpretation of the treatment effect in . In the literature, one usually characterizes these covariate effects through a model referred to as the partial linear (PL) model, which can be written as
where U is an univariate covariate such as the confounding factor, is an unknown smooth function playing the role of the nonparametric component, and other notation are defined as above. In the linear regression setting, when the response variable T is completely observed, many researchers have studied the PL model, see . Rather, to the best of our knowledge, there is little investigation for inference of model (1.2) with right censored data except several authors, where the model (1.2) is also called the semiparametric accelerated failure time partial linear model (AFT-PLM). Orbe in  adapted  ’s method and proposed a penalized weighted least square method with unknown function h being approximated by the cubic splines. But the statistical properties of the resulting estimators are not well established (page 112 of  ). Chen in  developed a strategy to eliminate the function by a proper stratification, thus proposed an estimation method, which is a Gehan-type extension of the Wilcoxon-Mann-Whitney estimating function. They proved the acquired rank estimate to be consistent and asymptotically normal. However, duo to stratifications, the nonparametric component is not likely to be estimated appropriately. Recently, Zou in  incorporated the penalized spline into the Gehan-type estimating function occurred in the rank-based inference for the AFT model and obtained estimate of the regression coefficients and nonparametric component simultaneously. Nevertheless, all current methods are either computationally extensive, which is more severe for large sample sizes (>400), because they all rely on the bootstrap technique to estimate the covariance matrix, or fail to provide an estimate of the nonparametric component when the effect of U is also of interest. Specially, with regard to  ’s procedure, the estimating function is non-smooth and may bring numerical difficulties when more covariates are incorporated into the model (1.2). In view of advances on dealing with non-smooth estimating function summarized above, it is possible to develop a smoothing estimation method for the semiparametric AFT-PLM. In this paper, we consider this issue. Once the smoothed estimating function is derived, under certain regularity conditions, the common inference techniques can be applied. Therefore, one can consistently estimate the covariance matrix by the plug-in rule without resorting to the time-consuming resampling method.
The rest of this paper is organized as follows. In Section 2, we introduce some notation and assumptions, and derive a smoothed Gehan estimating function through a polynomial-based smoothing method given that the nonparametric component h in (1.2) is parametrically approximated by the cubic B-splines. Under some regularity conditions, the resulting estimator is shown to be consistent and asymptotically normal. In particular, the asymptotic covariance matrix of estimators of the regression coefficients can be straightforwardly estimated, avoiding the substantial computation needed in the resampling approach. In addition, by virtue of the fact that the smoothed Gehan estimating function can be written as the gradient of a smooth convex loss function, we develop the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm to solve the numerical problem. The procedure is implemented in our simulation studies presented in Section 3. The results show our method performs quiet well, no matter for the estimation of the regression coefficients or the unknown function h. Section 4 presents the results of the application to two real data sets and, finally, some discussions in Section 5 conclude the paper.
2.1. Notation, Model and Assumptions
Suppose that there are n independent subjects under study. For the ith subject, , let denote the failure time and be the p-vector of covariates. In addition, one auxiliary covariate is also measured. Since the failure time is subject to right censoring, we will instead observe the i.i.d. vectors of , where , is the censoring indicator taking values 1 if the failure time is observed and 0 otherwise. Here C denotes the right-censoring time. In this paper, we assume satisfy the AFT-PLM defined in (1.2). As in most cases, we also assume T and C are independent given and, and are independent. For technical reasons, we assume the covariates Z and U have bounded supports, and without loss of generality, we take the support of U as the unit interval [0, 1]. Furthermore, we also assume , which is also required in the context of  .
2.2. Estimation Procedures
If the nonparametric component h is known or fully parametric, the statistical inference problem to be solved can be readily reduced to the usual AFT rank-based problem. Under suitable regularity conditions, it can be shown the resulting estimator is consistent and asymptotically normal. However, as argued in the introduction, the effect of U on the survival time is not certain and a misspecified form of h will lead to biased conclusions. To attain the flexibility of modelling and reliable results, it is more desirable not to impose an explicit form on h. However, just as to the incorporation of the unknown function h, the commonly-used rank-based inference approach can not directly be applied because doing so may suffer from the so-called “curse of dimensionality”.
A simple but useful method is to approximate the unknown function h by a spline. In the survival literature, the use of splines is common in  and among others. More details on splines can be found in . In this paper, we assume that the smooth function can be expressed as a function of B-splines, i.e.
where , are the B-spline basis functions of degree associated a sequence of knots
Let and . Then one can write the expression (3) as
In our numerical studies and real data analysis, the cubic B-splines, i.e. , are used in the basis expansion of . Generally 3 - 10 internal knots are adequate in practice in . In our implementation, we choose the number of internal knots, , respectively. As demonstrated in the following simulation studies, our strategy is appropriate and the results obtained are not sensitive to the selection of different numbers of internal knots. In addition, once the number is given, we put the knots equally spaced between the smallest and largest values of ’s.
By virtue of the expansion of h defined in (2.2), the AFT-PLM in (1.2) can be rewritten as
where . Consequently, by applying the weighted logrank estimation method in   along with the Gehan weight function, we have the following estimating function
where , which is often referred to as the Gehan estimating function. On the other hand, the Gehan estimating function in (2.4) is the gradient of the following convex loss function
which is called the Gehan loss function in . Naturally one can define the Gehan estimator of as the minimizer of the objective function , denoted by , and the optimization problem can be solved by linear programming in  for small sample sizes. To derive the large sample properties of the proposed estimators, we assume the smooth function is a spline with pre-specified knots. Doing so is due to mainly computational and theoretical consideration. The idea is also employed in issues investigated by   among others. Under some regularity conditions C1 - C4 described in , it is shown that the resulting estimator of is consistent and asymptotically normal with mean zero and an indirectly estimable covariance matrix. Thus to make inference for , one has to resort to the resampling approach which is computationally intensive, especially for large sample sizes, even modest sample sizes.
Note that the challenge encountered in current background is also reflected in the rank-based inference problem for the AFT model with censored data. As reviewed in the introduction, these difficulties arise from the nonsmoothness of the Gehan estimating function. Building on the recent advances of the smoothed rank-based method, it enables us to develop an easily-implemented estimation method for both the regression coefficients and the possibly nonlinear function h.
Define the following smoothing approximation to the Gehan loss function in (2.5),
where is a sufficiently smooth real-valued function, having the form
with sufficiently small but strictly positive . Clearly, is identical to in the entire line outside of the interval and replaces by a polynomial function in that interval. Through simple calculation, it can be seen that the function has a continuous second order derivative for any , especially, . Up to now, we define the estimator of as the minimizer of the smooth objective function in (2.6), i.e.
Then h can be estimated by . In fact, if the minimizer exists, it is also the solution to the estimating function , where
and is actually the smoothed version of in (2.4).
Remark 1. In the asymptotical analysis, the tuning parameter should decrease as the sample size n increases. In this paper, we set to be 10-4 when used. This idea is common in statistical computing and adopted by  and among others.
Remark 2. As an alternative, we propose to complete the computation by quasi-Newton methods, which avoid calculating the Hessian matrix. This advantage is more apparent when the dimension of parameter to be estimated is high or there exists an ill-posed problem. Explicitly, we recommend the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method in . And we terminate the iterative step when the relative tolerance is smaller than 10−4 in our implementation. As we will see, this procedure works well.
Under the conditions aforementioned and assumptions A1-A3 in , in line of arguments in the Appendix of  , it can be shown that the smoothed estimating function defined in (2.7) is asymptotically equivalent to the non-smooth one in (2.4). Furthermore, applying the arguments in the Appendix of , we can show that is consistent, and converges in distribution to a normal vector with mean zero and covariance matrix , where is the true value of ,
which can be consistently estimated by
respectively, where and denotes for a vector a. With regard to the second matrix, we obtain it by virtue of the asymptotic equivalence between and . Of course, we can also replace it with one derived in terms of of the smoothed estimating function . Compared to the non-smoothed one, is computationally convenient. Therefore, one can make inference for via the estimated covariance matrix . It is implemented in our simulation studies and real data analysis.
3. Numerical Studies
To assess the performance of our estimation method, we conduct an extensive simulation study under various scenarios. We independently generated the survival time from the following model
where , , , , and . In all simulations, the covariates and error are independently generated. The censoring times are generated independently from an exponential distribution with means varying to yield the censoring rates about 15%, 30%, 50%, respectively. Two different functions for the nonparametric component are considered. For the first one, called Case I, , which has one peak and one valley in the domain [0, 1], respectively. For the second one, , denoted as Case II. For spline approximation to the nonparametric component, we select the number of internal knots as , respectively, and locate them equally-spaced between the smallest and largest observations of ’s. We set defined in to be 10−4. The sample sizes , which correspond to the small, middle, and large levels respectively as used in , are considered in each scenario with 1000 runs of simulations. All simulations are implemented using the software Matlab, and the initial values are generated from a standard multivariate normal distribution. For , we record its empirical bias (Bias), sample standard deviation (SD), standard error estimation (SEE), and empirical coverage probability of 95% confidence interval. For the nonparametric part, we use the mean estimated integrated square error (IMSE), where
at the fixed grid points between zero and one with step 0.01, is the number of these grid points.
Table 1 summarizes the results for case I with 5 internal knots. It is seen that the proposed estimates are quiet accurate. Moreover, the coverage probabilities are close to the nominal level 0.95. Remarkably, the results reported here are superior to those summarized in Table 1 in  based on the P-spline Gehan estimating function, where no smoothing is employed. It is worthnoting that the sample standard deviation is comparable to the standard error estimation obtained by the sandwich estimator based on the smoothed estimating function, even for small sample size ( ), implying that using our procedure to estimate the covariance matrix for making statistical inference is appropriate while little computational effort is involved. For a fixed censoring rate, when the sample size increases, both the biases and standard error estimates decrease; for a fixed sample size, with the increment of the censoring rate, the biases and standard error estimates will increase. The same tendency is also reflected in the IMSE. Figure 1 shows the mean, median of estimation of the function from 1000 simulations with sample size 500, censoring rate 30%, and 5 internal equally-spaced internal
Table 1. Case I Sin curve model: results of parameters with 5 internal knots and cubic B-splines.
Figure 1. The mean (dotted), median (dash-dotted) of estimated function and 95% pointwise Monte Carlo intervals (dotted) for case I model with number of internal knots , respectively.
knots, which are both close to the true curve , and its 95% pointwise Monte Carlo intervals, which are constructed using the 2.5% and 97.5% sample quantiles of the estimated functions. Similar phenomena are also occurred in cases with remaining censoring rates and numbers of internal knots but not shown. Therefore, even a few number of knots are determined, the regression coefficient estimates and the estimated curve perform well enough.
In addition, as we have particularly stressed many times, for only a data set with sample size 1000, the proposed L-BFGS algorithm just need about 20 seconds to fulfill a complete inference, however, from our experiments, to compute the estimates of parameters and their standard errors through optimizing the
Table 2. Case II log curve model: results of parameters with 3, 5, 7 internal knots and cubic B-splines.
Figure 2. The mean (dotted), median (dash-dotted) of estimated function and 95% pointwise Monte Carlo intervals (dotted) for case II model with number of internal knots , respectively.
non-smooth objective function as in (7) with 500 resamples requires about five hours. Clearly, our procedure proposed here is more efficient in computation for practitioners.
4.1. Multiple Myeloma Data
The multiple myeloma data of set is the primary example in the PHREG procedure and available in the online SAS/STAT User’s Guide, which is analyzed by  for the AFT model, and  for the AFT-PLM in (1.2). In the study, there are total 65 patients with 48 deaths and 17 survivals. We consider the data consisting of possibly censored survival times (T) and two independent covariates: logBUN, the logarithm of Blood Urea Nitrogen; and age. As pointed out by , age variable might be a confounding factor and it is treated as the nonparametric component in the following model,
For the internal knots, we choose or 5 as the number of the internal knots and locate them equally spaced between the smallest and largest observations of age variable. Applying our proposed method, for , , its estimated standard error obtained from the sandwich estimator described in Subsection 2.3 is ; for , , . Note that the two slope estimates are both negative and the corresponding estimated standard error are rather small. That implies that Blood Urea Nitrogen is negatively related the log survival time. Similar conclusion is also attained by , where the estimate of is −1.955 with standard error estimate 0.807. Figure 3 displays the estimate of the nonparametric part. From it, it is visually justified to render the age variable has a nonlinear effect on the log survival time.
4.2. Nursing Home Usage Data
The data is from an experiment sponsored by National Center for Health Services Research in 1980-1982 designed to determine the effect of financial incentives on variation of patient care in nursing homes, involving 36 for profit nursing homes in San Diego, California. Full description of this data set is given in  and available from. The response variable T is measured in days and the total sample size is . In the model we consider later, several covariates are incorporated, explicitly: treatment, sex, marital status, three health status indicators (HSI), and age,
Figure 3. L, the number of internal knots used in the analysis of multiple myelome data.
where HSI1-HSI3 are three binary health status indicators ranging from the best health to the worst health. When analyzing the data set, we discard ten observations where the observed T is zero, then utilize the remaining 1051 records to accomplish our analysis. Results of coefficient estimates are presented in Table 3. Figure 4 reports the nonparametric part. It seems that the claim that the age variable has a linear effect on the survival time is plausible. The drastic influction is possibly resulted from the fact that there is little observation of age available in the right tail.  analyzed this data using the AFT model and found that the age variable is not statistically significant, which agrees with our results.
The accelerated failure time partial linear model (AFT-PLM) is a natural extension of the classic AFT model, which allows some covariate to relate to the log failure time in a nonlinear manner, and thus provides a more flexible and parsimonious way of modelling. In this paper, we employ the cubic B-splines to approximate the nonparametric smooth function in model (2), and doing so fascinates us to apply the efficient rank-based inference approach for the AFT model into our current situation. Explicitly, based on the recent achievements in dealing with non-smooth estimating function, we propose a polynomial-based smoothing Gehan estimating function and show that the resulting estimators are consistent and asymptotically normal under certain regularity conditions. Utilizing the smoothed version and the fact that it is the gradient of a smooth and convex loss function, to solve the solution to these equations, we develop the
Table 3. Analysis of nursing home usage data.
Figure 4. L, the number of internal knots used in the analysis of nursing home usage data.
L-BFGS method. In addition, the other primary advantage of our smoothing proposal is that one can estimate the standard error directly and efficiently through the sandwich-formed covariance matrix without resorting to computationally intensive resampling. As is seen in the simulation studies, our method performs well for estimation of both the regression coefficients and the nonparametric component.
Naturally, our method can be straightforwardly extended to cases where there are more than one covariate which has nonlinear effects. However, the number of the parameters to be estimated is increasing, especially when one chooses more internal knots to approximate those nonparametric components. At this time, it is necessary to incorporate a penalty term into the objective loss function defined in (2.6), and then proceed to make inference. To estimate the joint asymptotic covariance matrix, the sandwich estimator may encounter the unreliable numerical problem due to the high dimension; thus other efficient approaches should be further developed.
Another possible extension to the AFT-PLM in (1.2) is to consider the more general partial linear single index AFT model, which is specified as
where U is a vector of nuisance covariates, and Z is a vector of covariates of primary interest, is an unknown univariate smooth function that plays the role of a link function. This model is well-known due to the fact that it achieves dimension reduction purpose and avoids the “curse of dimensionality”. Of course, how to adapt our proposed method to this model is interesting and will be investigated in future.
In this paper, we have assumed that the unknown function is a spline function with fixed number of knots in establishing the asymptotic properties. Through the simulation studies, we find that a few number of knots is enough and the bias caused by the spline approximation is small and doesn’t affect the estimate of regression coefficients apparently. For cases without such assumptions, the number of knots should increase as the sample size increases, and developing asymptotic results in that setting is interesting but beyond the scope of this paper.
The Youth Foundation (ZJGQN004) of school of Zhangjiagang, JUST.
 Chung, M., Long, Q. and Johnson, B.A. (2013) A Tutorial on Rank-Based Coefficient Estimation for Censored Data in Small- and Large-Scale Problems. Statistics and Computing, 23, 601-614. https://doi.org/10.1007/s11222-012-9333-9
 Johnson, L.M. and Strawderman, R.L. (2009) Induced Smoothing for the Semiparametric Accelerated Failure Time Model: Asymptotics and Extensions to Clustered data. Biometrika, 96, 577-590. https://doi.org/10.1093/biomet/asp025
 Zou, Y., Zhang, J. and Qin, G. (2011) A Semiparametric Accelerated Failure Time Partial Linear Model and Its Application to Breast Cancer. Computational Statistics and Data Analysis, 55, 1479-1487. https://doi.org/10.1016/j.csda.2010.10.012
 Liu, X., Wang, L. and Liang, H. (2011) Estimation and Variable Selection for Semiparametric Additive Partial Linear Models (SS-09-140). Statistica Sinica, 21, 1225.
 Ma, S. and Kosorok, M.R. (2005) Penalized Log-Likelihood Estimation for Partly Linear Transformation Models with Current Status Data. The Annals of Statistics, 33, 2256-2290.
 Huang, J.Z. and Liu, L. (2006) Polynomial Spline Estimation and Inference of Proportional Hazards Regression Models with Flexible Relative Risk Form. Biometrics, 62, 793-802.
 Shang, S., Liu, M. and Zeleniuch-Jacquotte, A. (2013) Partially Linear Single Index Cox Regression Model in Nested Case-Control Studies. Computational Statistics and Data Analysis, 67, 199-212. https://doi.org/10.1016/j.csda.2013.05.011