Polynomial-Based Smoothing Estimation for a Semiparametric Accelerated Failure Time Partial Linear Model
Abstract: The accelerated failure time partial linear model allows the functional form of the effect of covariates to be possibly nonlinear and unknown. We propose to approximate the nonparametric component by cubic B-splines and construct a Gehan estimating function similar to that under the AFT model. Due to its non-smoothness, which will lead to computational challenge in estimating standard error, we propose a polynomial-based smoothing Gehan estimating function and compute the estimate of the parameters involved using the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Asymptotic properties of the resulting estimators are established. The proposed method presents a good performance in the simulation studies and is applied to two real data sets.

1. Introduction

In survival analysis, it is often of interest to explore the relationship between the failure time and a collection of covariates. For this purpose, a large number of semiparametric regression models and estimation methods have been developed. Among them, the Cox [1] proportional hazards (PH) model may be the most popular and widely-used statistical tool for analyzing survival data partly due to the efficient inference based on the partial likelihood and the availability of implementation in almost all existing softwares. However, the PH model requires that the hazard ratio is always constant over time between any two subjects with distinct covariates. In some situations, this assumption seems to be rather restrictive and hard to be met. See [2]. Thus, other alternatives to the PH model with non-proportional risks or more flexibility of modelling the covariates are desirable.

The AFT model assumes that the logarithm of survival time is linearly correlated to a vector of covariates of interest, which can be specified as

$\mathrm{log}\left(T\right)={\beta }^{\text{T}}Z+ϵ$, (1.1)

where T denotes the failure time; $\beta$ is a p-vector of regression coefficients to be estimated; Z is a p-vector of covariates; $ϵ$ is the random error with mean zero but unknown distribution; and the superscript “T” denotes the transpose of a column vector. In the presence of right censored data, several semiparametric estimates have been proposed, such as the least square estimator in [3] [4] and rank-based estimator in [5] [6] [7]. Nevertheless, these estimators have not been wildly used as it should be in practice due to lack of efficient and reliable computation algorithm to obtain the parameter and its standard error estimation. Especially, for the rank-based estimator, the computational challenge arises from two aspects: the estimating function used is non-smooth, i.e. a step function with respect to $\beta$ ; and the asymptotic slope matrix of the estimating function depends on the unknown hazard function of the error and its first derivative, which makes the direct estimation from the observed data impossible. Thus, most existing covariance matrix estimation methods rely on the bootstrap approach that is computation-demanding. The authors in [8] provided the first reliable and accurate estimating procedure via linear programming (LP) to obtain the Gehan estimator, a special case of the general weighted logrank estimator, which is implemented in R package “lss”. However, the merit of their LP strategy is greatly discounted for large, even modest sample sizes in [9] [10].

In view of these limitations, some more computationally efficient procedures for the Gehan estimator are introduced by [9] [11] [12] [13]. Explicitly speaking, Brown and Wang [12] developed a pseudo-Bayesian approach in [11] to derive a smoothing version, which is asymptotically equivalent to the original discontinuous Gehan estimating function. They called this technique the induced smoothing (IS) method. A significant advantage is that the smoothed estimating function is differential so that one can estimate the regression parameters and the covariance matrix simultaneously with common numerical methods, avoiding the computationally extensive resampling. Later on, the theoretical justification for the IS method was provided by [14], and they extended the method to clustered failure time data. On the other hand, Hellner [13] considered to directly approximate the indicator function in the Gehan estimating function with a known distribution function and showed the resulting estimator converges in distribution to a normal random vector with mean zero and covariance matrix which can be straightforwardly estimable. A detailed review is available in [9], where they also suggested a polynomial-based smoothing method which has more well-behaved performance than the two counterparts mentioned above. This technique will also be applied to our problem depicted in this paper.

Although the AFT model is useful, the assumption that each covariate has a linear effect on the log survival time is not appropriate in some situations. For example, in many clinical trials and biomedical studies, one is primarily concern about identifying the effect of a treatment when a confounding factor of less interest exists. In such cases, it is reasonable and useful to treat the confounding factor as a nonparametric component without loss of the easy interpretation of the treatment effect in [15]. In the literature, one usually characterizes these covariate effects through a model referred to as the partial linear (PL) model, which can be written as

$\mathrm{log}\left(T\right)={\beta }^{\text{T}}Z+h\left(U\right)+ϵ$, (1.2)

where U is an univariate covariate such as the confounding factor, $h\left(.\right)$ is an unknown smooth function playing the role of the nonparametric component, and other notation are defined as above. In the linear regression setting, when the response variable T is completely observed, many researchers have studied the PL model, see [16]. Rather, to the best of our knowledge, there is little investigation for inference of model (1.2) with right censored data except several authors, where the model (1.2) is also called the semiparametric accelerated failure time partial linear model (AFT-PLM). Orbe in [17] adapted [18] ’s method and proposed a penalized weighted least square method with unknown function h being approximated by the cubic splines. But the statistical properties of the resulting estimators are not well established (page 112 of [17] ). Chen in [15] developed a strategy to eliminate the function $h\left(.\right)$ by a proper stratification, thus proposed an estimation method, which is a Gehan-type extension of the Wilcoxon-Mann-Whitney estimating function. They proved the acquired rank estimate to be consistent and asymptotically normal. However, duo to stratifications, the nonparametric component is not likely to be estimated appropriately. Recently, Zou in [19] incorporated the penalized spline into the Gehan-type estimating function occurred in the rank-based inference for the AFT model and obtained estimate of the regression coefficients and nonparametric component simultaneously. Nevertheless, all current methods are either computationally extensive, which is more severe for large sample sizes (>400), because they all rely on the bootstrap technique to estimate the covariance matrix, or fail to provide an estimate of the nonparametric component when the effect of U is also of interest. Specially, with regard to [19] ’s procedure, the estimating function is non-smooth and may bring numerical difficulties when more covariates are incorporated into the model (1.2). In view of advances on dealing with non-smooth estimating function summarized above, it is possible to develop a smoothing estimation method for the semiparametric AFT-PLM. In this paper, we consider this issue. Once the smoothed estimating function is derived, under certain regularity conditions, the common inference techniques can be applied. Therefore, one can consistently estimate the covariance matrix by the plug-in rule without resorting to the time-consuming resampling method.

The rest of this paper is organized as follows. In Section 2, we introduce some notation and assumptions, and derive a smoothed Gehan estimating function through a polynomial-based smoothing method given that the nonparametric component h in (1.2) is parametrically approximated by the cubic B-splines. Under some regularity conditions, the resulting estimator is shown to be consistent and asymptotically normal. In particular, the asymptotic covariance matrix of estimators of the regression coefficients can be straightforwardly estimated, avoiding the substantial computation needed in the resampling approach. In addition, by virtue of the fact that the smoothed Gehan estimating function can be written as the gradient of a smooth convex loss function, we develop the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm to solve the numerical problem. The procedure is implemented in our simulation studies presented in Section 3. The results show our method performs quiet well, no matter for the estimation of the regression coefficients or the unknown function h. Section 4 presents the results of the application to two real data sets and, finally, some discussions in Section 5 conclude the paper.

2. Methods

2.1. Notation, Model and Assumptions

Suppose that there are n independent subjects under study. For the ith subject, $i=1,\cdots ,n$, let ${T}_{i}$ denote the failure time and ${Z}_{i}$ be the p-vector of covariates. In addition, one auxiliary covariate ${U}_{i}$ is also measured. Since the failure time is subject to right censoring, we will instead observe the i.i.d. vectors $\left({Y}_{i},{\delta }_{i},{Z}_{i},{U}_{i}\right)$ of $\left(Y,\delta ,Z,U\right)$, where $Y=\mathrm{min}\left(T,C\right)$, $\delta =I\left(T\le c\right)$ is the censoring indicator taking values 1 if the failure time is observed and 0 otherwise. Here C denotes the right-censoring time. In this paper, we assume $\left({T}_{i},{Z}_{i},{U}_{i}\right)$ $\left(i=1,\cdots ,n\right)$ satisfy the AFT-PLM defined in (1.2). As in most cases, we also assume T and C are independent given $\left(Z,U\right)$ and, $\left(Z,U\right)$ and $ϵ$ are independent. For technical reasons, we assume the covariates Z and U have bounded supports, and without loss of generality, we take the support of U as the unit interval [0, 1]. Furthermore, we also assume $\mathbb{E}\left[h\left(U\right)\right]=0$, which is also required in the context of [20] [21].

2.2. Estimation Procedures

If the nonparametric component h is known or fully parametric, the statistical inference problem to be solved can be readily reduced to the usual AFT rank-based problem. Under suitable regularity conditions, it can be shown the resulting estimator is consistent and asymptotically normal. However, as argued in the introduction, the effect of U on the survival time is not certain and a misspecified form of h will lead to biased conclusions. To attain the flexibility of modelling and reliable results, it is more desirable not to impose an explicit form on h. However, just as to the incorporation of the unknown function h, the commonly-used rank-based inference approach can not directly be applied because doing so may suffer from the so-called “curse of dimensionality”.

A simple but useful method is to approximate the unknown function h by a spline. In the survival literature, the use of splines is common in [22] and among others. More details on splines can be found in [23]. In this paper, we assume that the smooth function $h\left(u\right)$ can be expressed as a function of B-splines, i.e.

$h\left(u\right)=\underset{l=-\varrho }{\overset{L}{\sum }}\text{ }\text{ }{\gamma }_{l}{B}_{l}\left(u\right),$ (2.1)

where ${B}_{l}\left(u\right),l=-\varrho ,\cdots ,L$, are the B-spline basis functions of degree $\rho \ge 1$ associated a sequence of knots

${t}_{-\rho }=\cdots ={t}_{-1}={t}_{0}=0<{t}_{1}<\cdots <{t}_{L}<1={t}_{L+1}=\cdots ={t}_{L+\varrho +1}.$

Let $B\left(u\right)={\left\{{B}_{-\varrho }\left(u\right),\cdots ,{B}_{L}\left(u\right)\right\}}^{\text{T}}$ and $\gamma ={\left({\gamma }_{-\varrho },\cdots ,{\gamma }_{L}\right)}^{\text{T}}$. Then one can write the expression (3) as

$h\left(u\right)={\gamma }^{\text{T}}B\left(u\right).$ (2.2)

In our numerical studies and real data analysis, the cubic B-splines, i.e. $\varrho =3$, are used in the basis expansion of $h\left(u\right)$. Generally 3 - 10 internal knots are adequate in practice in [22]. In our implementation, we choose the number of internal knots, $L=3,5,7$, respectively. As demonstrated in the following simulation studies, our strategy is appropriate and the results obtained are not sensitive to the selection of different numbers of internal knots. In addition, once the number is given, we put the knots equally spaced between the smallest and largest values of ${U}_{i}$ ’s.

By virtue of the expansion of h defined in (2.2), the AFT-PLM in (1.2) can be rewritten as

$\mathrm{log}\left(T\right)={\beta }^{\text{T}}Z+{\gamma }^{\text{T}}B\left(U\right)+ϵ={\theta }^{\text{T}}X+ϵ$, (2.3)

where $X={\left({Z}^{\text{T}},B{\left(U\right)}^{\text{T}}\right)}^{\text{T}},\theta ={\left({\beta }^{\text{T}},{\gamma }^{\text{T}}\right)}^{\text{T}}$. Consequently, by applying the weighted logrank estimation method in [5] [6] along with the Gehan weight function, we have the following estimating function

${\Psi }_{G}\left(\theta \right)=\frac{1}{{n}^{2}}\underset{i=1}{\overset{n}{\sum }}\underset{j=1}{\overset{n}{\sum }}{\delta }_{i}\left({X}_{i}-{X}_{j}\right)I\left({e}_{i}\le {e}_{j}\right)$, (2.4)

where ${X}_{i}={\left({Z}_{i}^{\text{T}},B{\left({U}_{i}\right)}^{\text{T}}\right)}^{\text{T}},{e}_{i}={e}_{i}\left(\theta \right)=\mathrm{log}{Y}_{i}-{\theta }^{\text{T}}{X}_{i}$, which is often referred to as the Gehan estimating function. On the other hand, the Gehan estimating function ${\Psi }_{G}\left(\theta \right)$ in (2.4) is the gradient of the following convex loss function

${f}_{G}\left(\theta \right)=\frac{1}{{n}^{2}}\underset{i=1}{\overset{n}{\sum }}\underset{j=1}{\overset{n}{\sum }}{\delta }_{i}\left({e}_{j}-{e}_{i}\right)I\left({e}_{i}\le {e}_{j}\right)$, (2.5)

which is called the Gehan loss function in [9]. Naturally one can define the Gehan estimator of $\theta$ as the minimizer of the objective function ${f}_{G}\left(\theta \right)$, denoted by $\stackrel{^}{\theta }={\left({\stackrel{^}{\beta }}^{\text{T}},{\stackrel{^}{\gamma }}^{\text{T}}\right)}^{\text{T}}$, and the optimization problem can be solved by linear programming in [8] for small sample sizes. To derive the large sample properties of the proposed estimators, we assume the smooth function $h\left(.\right)$ is a spline with pre-specified knots. Doing so is due to mainly computational and theoretical consideration. The idea is also employed in issues investigated by [22] [24] among others. Under some regularity conditions C1 - C4 described in [6], it is shown that the resulting estimator $\stackrel{^}{\theta }$ of $\theta$ is consistent and asymptotically normal with mean zero and an indirectly estimable covariance matrix. Thus to make inference for $\beta$, one has to resort to the resampling approach which is computationally intensive, especially for large sample sizes, even modest sample sizes.

Note that the challenge encountered in current background is also reflected in the rank-based inference problem for the AFT model with censored data. As reviewed in the introduction, these difficulties arise from the nonsmoothness of the Gehan estimating function. Building on the recent advances of the smoothed rank-based method, it enables us to develop an easily-implemented estimation method for both the regression coefficients $\beta$ and the possibly nonlinear function h.

Define the following smoothing approximation to the Gehan loss function in (2.5),

${f}_{G,\epsilon }\left(\theta \right)=\frac{1}{{n}^{2}}\underset{i=1}{\overset{n}{\sum }}\underset{j=1}{\overset{n}{\sum }}{\delta }_{i}{K}_{\epsilon }\left({e}_{i}-{e}_{j}\right)$, (2.6)

where ${K}_{\epsilon }$ is a sufficiently smooth real-valued function, having the form

${K}_{\epsilon }\left(v\right)=\left\{\begin{array}{ll}-v,\hfill & \text{ }\text{if}\text{\hspace{0.17em}}v\le -\epsilon ;\text{ }\hfill \\ -\frac{1}{16{\epsilon }^{3}}{\left(v-\epsilon \right)}^{4}-\frac{1}{4{\epsilon }^{2}}{\left(v-\epsilon \right)}^{3},\hfill & \text{ }\text{if}\text{\hspace{0.17em}}-\epsilon \epsilon ;\text{ }\hfill \end{array}$

with sufficiently small but strictly positive $\epsilon$. Clearly, ${f}_{G,\epsilon }$ is identical to ${f}_{G}$ in the entire line outside of the interval $\left(-\epsilon ,\epsilon \right]$ and replaces ${f}_{G}$ by a polynomial function in that interval. Through simple calculation, it can be seen that the function ${f}_{G,\epsilon }$ has a continuous second order derivative for any $\epsilon >0$, especially, ${\mathrm{lim}}_{\epsilon \to 0}{f}_{G,\epsilon }\left(\theta \right)={f}_{G}\left(\theta \right)$. Up to now, we define the estimator of $\theta$ as the minimizer of the smooth objective function ${f}_{G,\epsilon }\left(\theta \right)$ in (2.6), i.e.

$\stackrel{˜}{\theta }={\left({\stackrel{˜}{\beta }}^{\text{T}},{\stackrel{˜}{\gamma }}^{\text{T}}\right)}^{\text{T}}=\mathrm{arg}{\mathrm{min}}_{\theta }{f}_{G,\epsilon }\left(\theta \right).$

Then h can be estimated by $\stackrel{˜}{h}\left(u\right)={\sum }_{l=-\varrho }^{L}{\stackrel{˜}{\gamma }}_{l}{B}_{l}\left(u\right)$. In fact, if the minimizer exists, it is also the solution to the estimating function ${\Psi }_{G,\epsilon }\left(\theta \right)$, where

${\Psi }_{G,\epsilon }\left(\theta \right)=\frac{\partial }{\partial \theta }{f}_{G,\epsilon }\left(\theta \right)=\frac{1}{{n}^{2}}\underset{i=1}{\overset{n}{\sum }}\underset{j=1}{\overset{n}{\sum }}{\delta }_{i}\left({X}_{i}-{X}_{j}\right){k}_{\epsilon }\left({e}_{i}-{e}_{j}\right)$, (2.7)

${k}_{\epsilon }\left(v\right)=-\frac{\partial }{\partial v}{K}_{\epsilon }\left(v\right)=\left\{\begin{array}{ll}1,\hfill & \text{ }\text{if}\text{\hspace{0.17em}}v\le -\epsilon ;\hfill \\ \frac{1}{4{\epsilon }^{3}}{\left(v-\epsilon \right)}^{3}+\frac{3}{4{\epsilon }^{2}}{\left(v-\epsilon \right)}^{2},\hfill & \text{ }\text{if}\text{\hspace{0.17em}}-\epsilon \epsilon ;\hfill \end{array}$

and ${\Psi }_{G,\epsilon }\left(\theta \right)$ is actually the smoothed version of ${\Psi }_{G}\left(\theta \right)$ in (2.4).

Remark 1. In the asymptotical analysis, the tuning parameter $\epsilon$ should decrease as the sample size n increases. In this paper, we set $\epsilon$ to be 10-4 when used. This idea is common in statistical computing and adopted by [9] and among others.

Remark 2. As an alternative, we propose to complete the computation by quasi-Newton methods, which avoid calculating the Hessian matrix. This advantage is more apparent when the dimension of parameter to be estimated is high or there exists an ill-posed problem. Explicitly, we recommend the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method in [25]. And we terminate the iterative step when the relative tolerance is smaller than 10−4 in our implementation. As we will see, this procedure works well.

2.3. Inference

Under the conditions aforementioned and assumptions A1-A3 in [14], in line of arguments in the Appendix of [9] [13], it can be shown that the smoothed estimating function ${\Psi }_{G,\epsilon }\left(\theta \right)$ defined in (2.7) is asymptotically equivalent to the non-smooth one ${\Psi }_{G}\left(\theta \right)$ in (2.4). Furthermore, applying the arguments in the Appendix of [14], we can show that $\stackrel{˜}{\theta }$ is consistent, and ${n}^{1/2}\left(\stackrel{˜}{\theta }-{\theta }_{0}\right)$ converges in distribution to a normal vector with mean zero and covariance matrix ${A}_{\epsilon }^{-1}\left({\theta }_{0}\right){D}_{\epsilon }\left({\theta }_{0}\right){A}_{\epsilon }^{-1}\left({\theta }_{0}\right)$, where ${\theta }_{0}$ is the true value of $\theta$,

$\begin{array}{l}{A}_{\epsilon }\left({\theta }_{0}\right)={\mathrm{lim}}_{n\to \infty }{\mathbb{E}\left\{\frac{\partial }{\partial \theta }{\Psi }_{G,\epsilon }\left(\theta \right)\right\}|}_{\theta ={\theta }_{0}},\\ {D}_{\epsilon }\left({\theta }_{0}\right)={\mathrm{lim}}_{n\to \infty }{\text{Var}\left\{{n}^{1/2}{\Psi }_{G,\epsilon }\left(\theta \right)\right\}|}_{\theta ={\theta }_{0}},\end{array}$

which can be consistently estimated by

${\stackrel{˜}{A}}_{\epsilon }\left(\stackrel{˜}{\theta }\right)={\frac{1}{{n}^{2}}\underset{i=1}{\overset{n}{\sum }}\underset{j=1}{\overset{n}{\sum }}{\delta }_{i}{\left({X}_{i}-{X}_{j}\right)}^{\otimes 2}{k}_{1,\epsilon }\left({e}_{i}-{e}_{j}\right)|}_{\theta =\stackrel{˜}{\theta }},$

${\stackrel{˜}{D}}_{\epsilon }\left(\stackrel{˜}{\theta }\right)={\frac{1}{{n}^{3}}\underset{i=1}{\overset{n}{\sum }}{\delta }_{i}{\left[\underset{j=1}{\overset{n}{\sum }}\left({X}_{i}-{X}_{j}\right)I\left({e}_{i}\le {e}_{j}\right)\right]}^{\otimes 2}|}_{\theta =\stackrel{˜}{\theta }},$

respectively, where ${k}_{1,\epsilon }\left(v\right)=\frac{\partial }{\partial v}{k}_{\epsilon }\left(v\right)$ and ${\otimes }^{2}$ denotes $a{a}^{\text{T}}$ for a vector a. With regard to the second matrix, we obtain it by virtue of the asymptotic equivalence between ${n}^{1/2}{\Psi }_{G,\epsilon }\left(\theta \right)$ and ${n}^{1/2}{\Psi }_{G}\left(\theta \right)$. Of course, we can also replace it with one derived in terms of of the smoothed estimating function ${\Psi }_{G,\epsilon }\left(\theta \right)$. Compared to the non-smoothed one, ${\stackrel{˜}{D}}_{\epsilon }\left(\stackrel{˜}{\theta }\right)$ is computationally convenient. Therefore, one can make inference for $\theta$ via the estimated covariance matrix ${\stackrel{˜}{A}}_{\epsilon }^{-1}\left(\stackrel{˜}{\theta }\right){\stackrel{˜}{D}}_{\epsilon }\left(\stackrel{˜}{\theta }\right){\stackrel{˜}{A}}_{\epsilon }^{-1}\left(\stackrel{˜}{\theta }\right)$. It is implemented in our simulation studies and real data analysis.

3. Numerical Studies

To assess the performance of our estimation method, we conduct an extensive simulation study under various scenarios. We independently generated the survival time ${T}_{i}$ from the following model

$\mathrm{log}\left({T}_{i}\right)={\beta }_{10}{Z}_{1i}+{\beta }_{20}{Z}_{2i}+{h}_{0}\left({U}_{i}\right)+{ϵ}_{i},\text{\hspace{0.17em}}\text{\hspace{0.17em}}i=1,\cdots ,n,$

where $\left({\beta }_{10},{\beta }_{20}\right)=\left(-0.3,0.3\right)$, ${Z}_{1i}~\text{Uniform}\left(-3,3\right)$, ${Z}_{2i}~\text{Binomial}\left(1,0.5\right)$, ${U}_{i}~\text{Uniform}\left(0,1\right)$, and ${ϵ}_{i}~\text{Normal}\left(0,1\right)$. In all simulations, the covariates $\left({Z}_{1i},{Z}_{2i},{U}_{i}\right)$ and error ${ϵ}_{i}$ are independently generated. The censoring times ${C}_{i}$ are generated independently from an exponential distribution with means varying to yield the censoring rates about 15%, 30%, 50%, respectively. Two different functions for the nonparametric component ${h}_{0}\left(.\right)$ are considered. For the first one, called Case I, ${h}_{0}\left({U}_{i}\right)=sin\left(2\pi {U}_{i}\right)$, which has one peak and one valley in the domain [0, 1], respectively. For the second one, ${h}_{0}\left({U}_{i}\right)=log\left(1+{U}_{i}^{2}\right)-0.2639$, denoted as Case II. For spline approximation to the nonparametric component, we select the number of internal knots as $L=3,5,7$, respectively, and locate them equally-spaced between the smallest and largest observations of ${U}_{i}$ ’s. We set $\epsilon$ defined in ${K}_{\epsilon }$ to be 10−4. The sample sizes $n=200,500,1000$, which correspond to the small, middle, and large levels respectively as used in [9], are considered in each scenario with 1000 runs of simulations. All simulations are implemented using the software Matlab, and the initial values are generated from a standard multivariate normal distribution. For $\stackrel{˜}{\beta }$, we record its empirical bias (Bias), sample standard deviation (SD), standard error estimation (SEE), and empirical coverage probability of 95% confidence interval. For the nonparametric part, we use the mean estimated integrated square error (IMSE), where

$\text{IMSE}=\left(1/ngrid\right)\underset{i=1}{\overset{ngrid}{\sum }}{\left(\stackrel{˜}{h}\left({u}_{i}\right)-{h}_{0}\left({u}_{i}\right)\right)}^{2},$

at the fixed grid points $\left\{{u}_{i}\right\}$ between zero and one with step 0.01, $ngrid$ is the number of these grid points.

Table 1 summarizes the results for case I with 5 internal knots. It is seen that the proposed estimates $\stackrel{˜}{\beta }$ are quiet accurate. Moreover, the coverage probabilities are close to the nominal level 0.95. Remarkably, the results reported here are superior to those summarized in Table 1 in [19] based on the P-spline Gehan estimating function, where no smoothing is employed. It is worthnoting that the sample standard deviation is comparable to the standard error estimation obtained by the sandwich estimator based on the smoothed estimating function, even for small sample size ( $n=200$ ), implying that using our procedure to estimate the covariance matrix for making statistical inference is appropriate while little computational effort is involved. For a fixed censoring rate, when the sample size increases, both the biases and standard error estimates decrease; for a fixed sample size, with the increment of the censoring rate, the biases and standard error estimates will increase. The same tendency is also reflected in the IMSE. Figure 1 shows the mean, median of estimation of the function ${h}_{0}$ from 1000 simulations with sample size 500, censoring rate 30%, and 5 internal equally-spaced internal

Table 1. Case I Sin curve model: results of parameters with 5 internal knots and cubic B-splines.

Figure 1. The mean (dotted), median (dash-dotted) of estimated function $\stackrel{˜}{h}\left(u\right)$ and 95% pointwise Monte Carlo intervals (dotted) for case I model with number of internal knots $L=3,5,7$, respectively.

knots, which are both close to the true curve $sin\left(2\pi u\right)$, and its 95% pointwise Monte Carlo intervals, which are constructed using the 2.5% and 97.5% sample quantiles of the estimated functions. Similar phenomena are also occurred in cases with remaining censoring rates and numbers of internal knots but not shown. Therefore, even a few number of knots are determined, the regression coefficient estimates and the estimated curve perform well enough.

Table 2 and Figure 2 represent the results under the same setting as that in Table 1 when we estimated the parameters in the Case II model, and shows similar results.

In addition, as we have particularly stressed many times, for only a data set with sample size 1000, the proposed L-BFGS algorithm just need about 20 seconds to fulfill a complete inference, however, from our experiments, to compute the estimates of parameters and their standard errors through optimizing the

Table 2. Case II log curve model: results of parameters with 3, 5, 7 internal knots and cubic B-splines.

Figure 2. The mean (dotted), median (dash-dotted) of estimated function $\stackrel{˜}{h}\left(u\right)$ and 95% pointwise Monte Carlo intervals (dotted) for case II model with number of internal knots $L=3,5,7$, respectively.

non-smooth objective function as in (7) with 500 resamples requires about five hours. Clearly, our procedure proposed here is more efficient in computation for practitioners.

4. Application

4.1. Multiple Myeloma Data

The multiple myeloma data of set is the primary example in the PHREG procedure and available in the online SAS/STAT User’s Guide, which is analyzed by [8] for the AFT model, and [15] for the AFT-PLM in (1.2). In the study, there are total 65 patients with 48 deaths and 17 survivals. We consider the data consisting of possibly censored survival times (T) and two independent covariates: logBUN, the logarithm of Blood Urea Nitrogen; and age. As pointed out by [15], age variable might be a confounding factor and it is treated as the nonparametric component in the following model,

$\mathrm{log}\left(T\right)=\beta ×\mathrm{log}\text{BUN}+h\left(\text{age}\right)+ϵ.$

For the internal knots, we choose $L=3$ or 5 as the number of the internal knots and locate them equally spaced between the smallest and largest observations of age variable. Applying our proposed method, for $L=3$, $\stackrel{˜}{\beta }=-1.4965$, its estimated standard error obtained from the sandwich estimator described in Subsection 2.3 is $\stackrel{˜}{\sigma }=0.0073$ ; for $L=5$, $\stackrel{˜}{\beta }=-1.5436$, $\stackrel{˜}{\sigma }=0.0109$. Note that the two slope estimates are both negative and the corresponding estimated standard error are rather small. That implies that Blood Urea Nitrogen is negatively related the log survival time. Similar conclusion is also attained by [15], where the estimate of $\beta$ is −1.955 with standard error estimate 0.807. Figure 3 displays the estimate of the nonparametric part. From it, it is visually justified to render the age variable has a nonlinear effect on the log survival time.

4.2. Nursing Home Usage Data

The data is from an experiment sponsored by National Center for Health Services Research in 1980-1982 designed to determine the effect of financial incentives on variation of patient care in nursing homes, involving 36 for profit nursing homes in San Diego, California. Full description of this data set is given in [26] and available from. The response variable T is measured in days and the total sample size is $n=1601$. In the model we consider later, several covariates are incorporated, explicitly: treatment, sex, marital status, three health status indicators (HSI), and age,

$\begin{array}{l}\mathrm{log}T={\beta }_{1}×\text{treatment}+{\beta }_{2}×\text{sex}+{\beta }_{3}×\text{maritalstatus}\\ \text{ }\text{ }\text{ }+{\beta }_{4}×{\text{HSI}}_{1}+{\beta }_{5}×{\text{HSI}}_{2}+{\beta }_{6}×{\text{HSI}}_{3}+h\left(\text{age}\right)+ϵ,\end{array}$

Figure 3. L, the number of internal knots used in the analysis of multiple myelome data.

where HSI1-HSI3 are three binary health status indicators ranging from the best health to the worst health. When analyzing the data set, we discard ten observations where the observed T is zero, then utilize the remaining 1051 records to accomplish our analysis. Results of coefficient estimates are presented in Table 3. Figure 4 reports the nonparametric part. It seems that the claim that the age variable has a linear effect on the survival time is plausible. The drastic influction is possibly resulted from the fact that there is little observation of age available in the right tail. [9] analyzed this data using the AFT model and found that the age variable is not statistically significant, which agrees with our results.

5. Discussions

The accelerated failure time partial linear model (AFT-PLM) is a natural extension of the classic AFT model, which allows some covariate to relate to the log failure time in a nonlinear manner, and thus provides a more flexible and parsimonious way of modelling. In this paper, we employ the cubic B-splines to approximate the nonparametric smooth function in model (2), and doing so fascinates us to apply the efficient rank-based inference approach for the AFT model into our current situation. Explicitly, based on the recent achievements in dealing with non-smooth estimating function, we propose a polynomial-based smoothing Gehan estimating function and show that the resulting estimators are consistent and asymptotically normal under certain regularity conditions. Utilizing the smoothed version and the fact that it is the gradient of a smooth and convex loss function, to solve the solution to these equations, we develop the

Table 3. Analysis of nursing home usage data.

Figure 4. L, the number of internal knots used in the analysis of nursing home usage data.

L-BFGS method. In addition, the other primary advantage of our smoothing proposal is that one can estimate the standard error directly and efficiently through the sandwich-formed covariance matrix without resorting to computationally intensive resampling. As is seen in the simulation studies, our method performs well for estimation of both the regression coefficients and the nonparametric component.

Naturally, our method can be straightforwardly extended to cases where there are more than one covariate which has nonlinear effects. However, the number of the parameters to be estimated is increasing, especially when one chooses more internal knots to approximate those nonparametric components. At this time, it is necessary to incorporate a penalty term into the objective loss function defined in (2.6), and then proceed to make inference. To estimate the joint asymptotic covariance matrix, the sandwich estimator may encounter the unreliable numerical problem due to the high dimension; thus other efficient approaches should be further developed.

Another possible extension to the AFT-PLM in (1.2) is to consider the more general partial linear single index AFT model, which is specified as

$\mathrm{log}\left(T\right)={\beta }^{\text{T}}Z+h\left({\alpha }^{\text{T}}U\right)+ϵ$,

where U is a vector of nuisance covariates, and Z is a vector of covariates of primary interest, $h\left(.\right)$ is an unknown univariate smooth function that plays the role of a link function. This model is well-known due to the fact that it achieves dimension reduction purpose and avoids the “curse of dimensionality”. Of course, how to adapt our proposed method to this model is interesting and will be investigated in future.

In this paper, we have assumed that the unknown function is a spline function with fixed number of knots in establishing the asymptotic properties. Through the simulation studies, we find that a few number of knots is enough and the bias caused by the spline approximation is small and doesn’t affect the estimate of regression coefficients apparently. For cases without such assumptions, the number of knots should increase as the sample size increases, and developing asymptotic results in that setting is interesting but beyond the scope of this paper.

Foundation Item

The Youth Foundation (ZJGQN004) of school of Zhangjiagang, JUST.

Cite this paper: Chen, W. and Ren, F.L. (2020) Polynomial-Based Smoothing Estimation for a Semiparametric Accelerated Failure Time Partial Linear Model. Open Access Library Journal, 7, 1-15. doi: 10.4236/oalib.1106824.
References

[1]   Cox, D.R. (1972) Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological), 34, 187-220. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x

[2]   Struthers, C.A. and Kalbfleisch, J.D. (1986) Misspecified Proportional Hazard Models. Biometrika, 73, 363-369. https://doi.org/10.1093/biomet/73.2.363

[3]   Buckley, J. and James, I. (1979) Linear Regression with Censored Data. Biometrika, 66, 429-436. https://doi.org/10.1093/biomet/66.3.429

[4]   Ritov, Y. (1990) Estimation in a Linear Regression Model with Censored Data. The Annals of Statistics, 18, 303-328. https://doi.org/10.1214/aos/1176347502

[5]   Tsiatis, A.A. (1990) Estimating Regression Parameters Using Linear Rank Tests for Censored Data. The Annals of Statistics, 18, 354-372. https://doi.org/10.1214/aos/1176347504

[6]   Ying, Z. (1993) A Large Sample Study of Rank Estimation for Censored Regression Data. The Annals of Statistics, 21, 76-99. https://doi.org/10.1214/aos/1176349016

[7]   Fygenson, M. and Ritov, Y. (1994) Monotone Estimating Equations for Censored Data. The Annals of Statistics, 22, 732-746. https://doi.org/10.1214/aos/1176325493

[8]   Jin, Z., Lin, D.Y. and Wei, L.J. (2003) Rank-Based Inference for the Accelerated Failure Time Model. Biometrika, 90, 341-353. https://doi.org/10.1093/biomet/90.2.341

[9]   Chung, M., Long, Q. and Johnson, B.A. (2013) A Tutorial on Rank-Based Coefficient Estimation for Censored Data in Small- and Large-Scale Problems. Statistics and Computing, 23, 601-614. https://doi.org/10.1007/s11222-012-9333-9

[10]   Chiou, S.H., Kang, S. and Yan, J. (2012) Fast Accelerated Failure Time Modeling for Case-Cohort Data. Statistics and Computing, 24, 559-568.

[11]   Brown, B.M. and Wang, Y.G. (2005) Standard Errors and Covariance Matrices for Smoothed Rank Estimators. Biometrika, 92, 149-158. https://doi.org/10.1093/biomet/92.1.149

[12]   Brown, B.M. and Wang, Y.G. (2007) Induced Smoothing for Rank Regression with Censored Survival Times. Statistics in Medicine, 26, 828-836. https://doi.org/10.1002/sim.2576

[13]   Heller, G. (2007) Smoothed Rank Regression with Censored Data. Journal of the American Statistical Association, 102, 552-559. https://doi.org/10.1198/016214506000001257

[14]   Johnson, L.M. and Strawderman, R.L. (2009) Induced Smoothing for the Semiparametric Accelerated Failure Time Model: Asymptotics and Extensions to Clustered data. Biometrika, 96, 577-590. https://doi.org/10.1093/biomet/asp025

[15]   Chen, K., Shen, J. and Ying, Z. (2005) Rank Estimation in Partial Linear Model with Censored Data. Statistica Sinica, 15, 767-779.

[16]   Hardle, W. and Liang, H. (2007) Partially Linear Models. Springer, Berlin Heidelberg.

[17]   Orbe, J., Ferreira, E. and Nez-Anton, V. (2003) Censored Partial Regression. Biostatistics, 4, 109-121. https://doi.org/10.1093/biostatistics/4.1.109

[18]   Stute, W. (1993) Consistent Estimation under Random Censorship When Covariables Are Present. Journal of Multivariate Analysis, 45, 89-103. https://doi.org/10.1006/jmva.1993.1028

[19]   Zou, Y., Zhang, J. and Qin, G. (2011) A Semiparametric Accelerated Failure Time Partial Linear Model and Its Application to Breast Cancer. Computational Statistics and Data Analysis, 55, 1479-1487. https://doi.org/10.1016/j.csda.2010.10.012

[20]   Liu, X., Wang, L. and Liang, H. (2011) Estimation and Variable Selection for Semiparametric Additive Partial Linear Models (SS-09-140). Statistica Sinica, 21, 1225.
https://doi.org/10.5705/ss.2009.140

[21]   Ma, S. and Kosorok, M.R. (2005) Penalized Log-Likelihood Estimation for Partly Linear Transformation Models with Current Status Data. The Annals of Statistics, 33, 2256-2290.
https://doi.org/10.1214/009053605000000444

[22]   Huang, J.Z. and Liu, L. (2006) Polynomial Spline Estimation and Inference of Proportional Hazards Regression Models with Flexible Relative Risk Form. Biometrics, 62, 793-802.
https://doi.org/10.1111/j.1541-0420.2005.00519.x

[23]   De Boor, C. (1978) A Practical Guide to Splines. Springer-Verlag, New York.
https://doi.org/10.1007/978-1-4612-6333-3

[24]   Shang, S., Liu, M. and Zeleniuch-Jacquotte, A. (2013) Partially Linear Single Index Cox Regression Model in Nested Case-Control Studies. Computational Statistics and Data Analysis, 67, 199-212. https://doi.org/10.1016/j.csda.2013.05.011

[25]   Nocedal, J. and Wright, S.J. (2006) Numerical Optimization. Springer, New York.

[26]   Morris, C.N., Norton, E.C. and Zhou, X.H. (1994) Parametric Duration Analysis of Nursing Home Usage. In: Case Studies in Biometry, John Wiley and Sons, New York, 231-248.

Top