Sequential Shrinkage Estimate for COX Regression Models with Uncertain Number of Effective Variables

Show more

1. Introduction

The COX proportional hazards model is a popular choice for the analysis of censored survival data with covariates, illustrated in [1] [2] [3]. It has been widely used in many areas, such as biomedical research and engineering, for assessing covariate effects on the time to some events in the presence. However, in applications such as Biology, Engineering and Epidemiology there are data sets that usually have a large number of explanatory variables but only a few of them contributes to the model. They were called effective variables in [4]. Many methods are focused on how to identify the effective variables such as LASSO and LARS, see in [5] and [6], however, people also want to know how many samples can identify the effective variables and simultaneously make the parameter estimates achieve a pre-specified accuracy. It is very important to those who care about the cost of samples such as Biology and Epidemiology. For linear regression model, Wang and Chang propose a sequential shrinkage estimate method to identify the effective variables and attain accuracy of parameter estimate in [4]. For COX regression models, similar methods have not been proposed, so there is still a lot of work to do for this problem.

For handling the problem mentioned above, we propose a sequential procedure for constructing the fixed size confidence set for effective parameters based on an adaptive shrinkage estimate (ASE) such that the effective coefficients can be efficiently identified with the minimum sample size. Suppose the conditional hazard rate of a survival time, *T*, given the regressor vector, *X*, is written as

$h\left(t|x\right)={h}_{0}\left(x\right)\mathrm{exp}\left({\beta}^{\prime}X\right),t\ge 0$ (1)

In the paper, it will be studied under fixed design and the consistency and asymptotic properties of the proposed estimator will be obtained under this design. The rest of this paper is organized as follows. In Section 2, we will give the adaptive shrinkage estimate (ASE) based on the Maximum Partial Likelihood Estimate (MPLE) of COX regression models and their asymptotic properties. In section 3, sequential sampling strategy based on ASE and stopping rule as well as random size confident set is presented. In Section 4, an example with numerical simulation is given to illustrate the performance of the proposed method via sequential fixed size confidence estimation using synthesized data sets.

2. Sequential Adaptive Shrinkage Estimate

2.1. Asymptotic Properties of MPLE

Let
${T}_{i}$ and
${C}_{i}$ be the potential failure time and censoring time of the *i*-th
$\left(i\in {N}_{+}\right)$ subject from a random sample with *n* individuals, respectively, and
${X}_{i}={\left({X}_{i1},{X}_{i2},\cdots ,{X}_{in}\right)}^{\text{T}}$ be a p-dimensional vector of covariates which assumed to be time-independent throughout this paper for the *i*-th individual. Assume that
${T}_{i}$ and
${C}_{i}$ are conditionally independent given
${X}_{i}$. In practice, the failure time
${T}_{i}$ might not always be observed due to censoring because of the termination of study or early withdrawal from the study. What we can actually observe are
${Y}_{i}=\mathrm{min}\left\{{T}_{i},{C}_{i}\right\}$, the smaller of the failure time and the censoring time, and
${\delta}_{i}=I\left\{{T}_{i}\le {C}_{i}\right\}$, the indicator that failure has been observed. The data then consist of the triplets
$\left({Y}_{i},{\delta}_{i},{X}_{i}\right),i=1,2,\cdots ,n$. Suppose there is no tie among failure times. Let
${t}_{1},{t}_{2},\cdots ,{t}_{n}$ denote the *N* ordered times of observed failures and (*j*) be the label of the individual that fails at
${t}_{j}$. Let
${R}_{j}$ be the risk set at time
${t}_{j}$, *i.e.*
${R}_{j}=\left\{i:{Y}_{i}\ge {t}_{j}\right\}$. The partial likelihood of the model (1) is defined as

$\underset{j=1}{\overset{N}{\prod}}\frac{\mathrm{exp}\left({\beta}^{\text{T}}{X}_{\left(j\right)}\right)}{{\displaystyle {\sum}_{i\in {R}_{j}}\mathrm{exp}\left({\beta}^{\text{T}}{X}_{\left(i\right)}\right)}}$ (2)

and the log partial likelihood is then,

$L\left(\beta \right)={\displaystyle \underset{j=1}{\overset{N}{\sum}}\left\{{\beta}^{\text{T}}{X}_{\left(j\right)}-\mathrm{log}\left[{\displaystyle \underset{i\in {R}_{j}}{\sum}\mathrm{exp}\left({\beta}^{\text{T}}{X}_{i}\right)}\right]\right\}}$ (3)

The maximum partial likelihood estimate of $\beta $, $\stackrel{\u02dc}{\beta}$, is found by solving the score equation $U\left(\beta \right)=0$, where

$U\left(\beta \right)=\frac{\partial L\left(\beta \right)}{\partial \beta}={\displaystyle \underset{j=1}{\overset{N}{\sum}}\left\{{X}_{\left(j\right)}-\frac{{\displaystyle {\sum}_{i\in {R}_{j}}{X}_{i}\mathrm{exp}\left({\beta}^{\text{T}}{X}_{i}\right)}}{{\displaystyle {\sum}_{i\in {R}_{j}}\mathrm{exp}\left({\beta}^{\text{T}}{X}_{i}\right)}}\right\}}$ (4)

2.2. Adaptive Shrinkage Estimate

Let
$\kappa =\kappa \left(n\right)$ be a non-random function of *n* such that for some
$0<\delta <1/2$ and
$\gamma >0$,
${n}^{1/2}\kappa \to 0$ and
${n}^{1/2+\gamma \delta}\kappa \to \infty $, as
$n\to \infty $. In this paper, we need the following assumptions:

(A1) ${x}_{i}$ satisfies ${\mathrm{sup}}_{i}\Vert {x}_{i}\Vert <\infty $, and the residual term

${\epsilon}_{i}=\stackrel{^}{\Lambda}\left({Y}_{i}\right)\mathrm{exp}\left({\stackrel{^}{\beta}}^{\text{T}}{X}_{i}\right)$

has $E{\left|{\epsilon}_{i}\right|}^{\zeta}<\infty $ for some $\zeta >2$, where $\stackrel{^}{\Lambda}$ is some cumulative baseline function.

(A2) $\underset{n\to \infty}{\mathrm{lim}}{I}_{n}\left(\beta \right)/n=\Sigma $, where ${I}_{n}\left(\beta \right)$ is the information matrix of $\beta $ and $\Sigma $ is a positive matrix.

Then, Theorem 3.1 in [7] implies that ${n}^{1/2-\eta}\left({\stackrel{\u02dc}{\beta}}_{n}-{\beta}_{0}\right)=O\left(1\right)$ almost surely as n tends to $\infty $ for some $\eta >0$. Define

${\kappa}_{nj}=\kappa {\left|{\stackrel{\u02dc}{\beta}}_{nj}\right|}^{-\gamma}$

with
${\stackrel{\u02dc}{\beta}}_{nj}$ being the *j*-th components of
${\stackrel{\u02dc}{\beta}}_{n}$. From (16) and asymptotic property of
${\stackrel{\u02dc}{\beta}}_{n}$ we have
${n}^{1/2}{\kappa}_{nj}\to 0\times I\left({\beta}_{0j}\ne 0\right)+\infty \times I\left({\beta}_{0j}=0\right)$ almost surely as
$n\to \infty $. Where
$I(\cdot )$ denotes the indicator function and we presume that
$0\times \infty =0$. Similar to Wang and Chang, define
${\stackrel{^}{\beta}}_{n}={I}_{n}\left(\epsilon \right){\stackrel{\u02dc}{\beta}}_{n}$ as an adaptive shrinkage estimate (ASE) of
${\beta}_{0}$, where
${I}_{n}\left(\epsilon \right)=diag\left\{{I}_{n1}\left(\epsilon \right),{I}_{n2}\left(\epsilon \right),\cdots ,{I}_{np}\left(\epsilon \right)\right\}$ is a
$p\times p$ diagonal matrix. So far, we get good statistical properties of the proposed ASE estimate under non-random sample size, but our goal is to determine a sample size under which the ASE attains the required accuracy. To this end, we will introduce the sequential sampling scheme based on the ASE below. It is known that construction of the confidence set for
${\beta}_{0}$ depends on the asymptotic distribution of
${\stackrel{^}{\beta}}_{n}$ and sample size under sequential analysis is a random variable. So we need to study asymptotic properties of ASE under random sample size. Fortunately, property of uniform continuity in probability, see in [8] and [9], is a sufficient condition such that the randomly stopped sequence has the same asymptotic distribution as the fixed sample size estimate. That is,
$\sqrt{n}\left({\stackrel{^}{\beta}}_{n}-{\beta}_{0}\right),n=1,2,\cdots $, has the property of uniform continuity in probability, which indicates the following Theorem holds.

Theorem 1. Suppose that the (A1) and (A2) are satisfied, and let $N\left(t\right)$ be a positive integer-valued random variable such that $N\left(t\right)/t$ converges to 1 in probability as $t\to \infty $. Then

$\sqrt{N\left(t\right)}\left({\stackrel{^}{\beta}}_{N\left(t\right)}-{\beta}_{0}\right)\to N\left(0,{I}_{0}\Sigma {I}_{0}^{-1}\right)$

in distribution as $t\to \infty $.

From Theorem 1, we can construct a confidence set of
${\beta}_{0}$ and a stopping rule on sequential sampling procedure to determine final sample size. Let
$\left\{\left({y}_{i},{x}_{i}\right):i=1,2,\cdots ,k\right\}$ be the first *k* observations and denoted by
${C}_{k}$. Define a stopping rule
${N}_{d}$ as

$N={N}_{d}\equiv \mathrm{inf}\left\{k:\frac{{d}^{2}}{{a}_{k}^{2}}\ge {\nu}_{k},\forall k\ge {n}_{0}\right\}$ (5)

For sequential estimation procedure, one new observation is collected at a time until the stopping criterion is satisfied. When the stopping rule holds, based on *N* samples a confidence set of
${\beta}_{0}$ is constructed as follow,

${R}_{N}=\left\{Z\in {R}^{p}:\frac{{S}_{N}}{N}\le \frac{{d}^{2}}{{\nu}_{N}};{I}_{{N}_{j}}\left(\epsilon \right)=0\to {z}_{j}=0,1\le j\le p\right\}$ (6)

where ${S}_{N}={\left({Z}_{{N}_{1}}-{\stackrel{^}{\beta}}_{{N}_{1}}\right)}^{\text{T}}{\stackrel{\u02dc}{\Sigma}}_{11}\left({Z}_{{N}_{1}}-{\stackrel{^}{\beta}}_{{N}_{1}}\right)$. Properties of the sequential procedure and the confidence set ${R}_{N}$ are summarized below.

Theorem 2. Assume that the (A1) and (A2) are satisfied, and let *N* be the

stopping time defined in Equation (5). Then 1) $\underset{d\to 0}{\mathrm{lim}}{d}^{2}N/{a}^{2}\nu =1$ almost surely; 2) $\underset{d\to 0}{\mathrm{lim}}{d}^{2}N/{a}^{2}\nu =1$ ; 3) $\underset{d\to 0}{\mathrm{lim}}{d}^{2}E\left(N\right)/{a}^{2}\nu =1$ ; 4) $\underset{d\to 0}{\mathrm{lim}}{\stackrel{^}{p}}_{0}\left(N\right)={p}_{0}$ almost surely; 5) $\underset{d\to 0}{\mathrm{lim}}E\left({\stackrel{^}{p}}_{0}\left(N\right)\right)={p}_{0}$ where $\nu $ is the maximum eigen-value of matrix ${I}_{0}{\Sigma}^{-1}{I}_{0}$.

3. Example and Simulation

We evaluate the performance of the proposed method via sequential fixed size confidence estimation using synthesized data sets. As mentioned previously, by the definition of the stopping rule, when sampling is stopped, the final confidence ellipsoid constructed will have the prescribed precision and coverage probability. Thus, we can compare the average stopping times of procedures based on MPLE and ASE. Since the proposed method ignores the non-effective variables, we expect the average stopping time to be significantly smaller than that of the procedure based on MPLE with no variable identification mechanism. If the ${p}_{0}$ variables are known in advance, then the most efficient procedureis, of course, to use only these ${p}_{0}$ variables. Therefore, we also construct a sequential procedure under such a situation, and the results of the cases with known ${p}_{0}$ can serve as the baseline, in which the smallest sample size is achieved, asymptotically.

The synthesized data sets for the model with fixed designs are generated as follows: the regressor ${x}_{i}$ are generated independently from a standard multivariate normal distribution with mean 0 and identity covariance matrix beforehand, and the error term ${e}_{i}$ is independently drawn from the standard normal distribution for each $i\ge 1$. The system error is assumed to follow the standard normal distribution. The response generated by model (1) with the arbitrary ${h}_{0}\left(t\right)={t}^{2}$ without loss of the generality and the true parameter ${\beta}_{0}=\left(-1.2,2.0,0,0,0,0,0,0,0,0\right)$ with 8 non-effective variables. Different precisions of confidence ellipsoid $d\in \left\{0.3,0.4,0.5,0.6\right\}$ are chosed with coverage probability equal to 95% $\alpha =0.05$ in the simulation. We choose $\gamma =1$, $\delta =0.45$ and $\theta =0.75$ in analyzing simulated data. When applying the ASE method, the regularization parameter $\epsilon $ needs to be determined by some model selection criteria, as the AIC, BIC together with a GCV method. For convenience, we only use BIC to illustrate our method,

$\text{BIC}=-2\left({\displaystyle \underset{j=1}{\overset{N}{\sum}}\left({\beta}^{\text{T}}{X}_{\left(j\right)}-\mathrm{log}\left({\displaystyle \underset{j\in {R}_{j}}{\sum}\mathrm{exp}\left({\beta}^{\text{T}}{X}_{i}\right)}\right)\right)}\right)+\mathrm{log}\left(n\right)\times df/n$,

where *df* is the number of the non-zero components in
$\beta $.

Table 1 state results of sequential sampling method for COX regression. In the table, we list final sample size *N* (stopping time),
$\kappa ={d}^{2}N/{a}^{2}\nu $ and empirical coverage probability CP of the 95% confidence set
${R}_{N}$. For all of the three cases: MPLE,
${\text{MPLE}}_{{p}_{0}}$, ASE, the value
$\kappa $ of is very close to 1, and the empirical coverage probability CP approaches the Normal 95% as *d* decreases, as stated in Theorem 2. However, the sample size *N* of MPLE are much larger than those of the other two cases, and ASE has sample size very close to those of
${\text{MPLE}}_{{p}_{0}}$. In conclusion, the proposed ASE is more efficient than MPLE.

Table 2 reports powers of identity effective variables and effective variables and estimates of the regression coefficients for COX regression. We can see that numbers of incorrectly identified zero variables ( ${N}_{ic}^{\ast}$ ) using ASE is almost close to 0, and the number of correctly identified zero variables ( ${N}_{c}^{\ast}$ ) are all very close to the true number of effective variables (2 and 8). These results suggest that ${\stackrel{^}{p}}_{0}$ is a good estimator of ${p}_{0}$ under the sequential sampling method based on ASE. The MPLE procedure does not identify the effective variables, so ${N}_{c}^{\ast}$ and ${N}_{ic}^{\ast}$ are not available. In addition, all of parameter estimates of effective variables are very close to the true values.

Table 1. Results of sequential sampling method based on ASE, MPLE with all variables and ${\text{MPLE}}_{{p}_{0}}$ with only ${p}_{0}$ non-zero variables for COX regression model.

${\kappa}^{*}={d}^{2}N/\left({a}^{2}\nu \right)$ ; $C{P}^{+}$ is the empirical coverage probability of 95% confidence ellipsoid region ${R}_{N}$ ; **Empirical standard deviations are in parentheses.

Table 2. Power of variable identification and estimation of nonzero components under sequential sampling method based on ASE and MPLE with COX regression model.

${N}_{ic}^{\ast}$ and ${N}_{c}^{\ast}$ are the average number of zero components in $\beta $ correctly identified and nonzero components incorrectly estimated as zero values, respectively.

4. Conclusion

Based on an ASE estimate of the parameter in COX regression model, a sequential sampling procedure is constructed to estimate the minimum sample size to identify the effective variables and simultaneously make estimate of parameters with required accuracy. We prove that the proposed sequential procedure is asymptotically optimal in the sense of Chow and Robbins [10]. Simulation studies show that the proposed method can save a large sample size compared to the traditional sequential sampling method. However, this paper supposes the dimension of variables is fixed, not varying as sample size. Our future work is to investigate the properties of sequential sampling method with varying number of variables as sample size.

Supported

This research was supported by Research projects of universities in Xinjiang Uygur Autonomous Region under Grant No. XJEDU2016I033 and Xinjiang Normal University postdoctoral research foundation under Grant No. XJNUBS1539.

References

[1] COX, D.R. (1972) Regression Models and Life-Tables. Journal of the Royal Statistical Society: Series B, 34, 187-220.

https://doi.org/10.1111/j.2517-6161.1972.tb00899.x

[2] COX, D.R. (1975) Partial Likelihood. Biometrika, 62, 269-276.

https://doi.org/10.1093/biomet/62.2.269

[3] Andersen, P.K. and Gill, R.D. (1982) COX’s Regression Model for Counting Processes: A Large Sample Study. Annals of Statistics, 10, 1100-1120.

https://doi.org/10.1214/aos/1176345976

[4] Wang, Z.F. and Chang, Y.I. (2013) Sequential Estimate for Linear Regression Models with Uncertain Number of Effective Variables. Metrika, 76, 949-978.

https://doi.org/10.1007/s00184-012-0426-4

[5] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58, 267-288.

https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

[6] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least Angle Regression. Journal of Annals of Statistics, 32, 407-499.

https://doi.org/10.1214/009053604000000067

[7] Tsiatis, A.A. (1981) A Large Sample Study of COX’s Regression Model. Annals of Statistics, 9, 93-108.

https://doi.org/10.1214/aos/1176345335

[8] Anscombe, F.J. (1952) Large Sample Theory of Sequential Estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 48, 600-607.

https://doi.org/10.1017/S0305004100076386

[9] Woodroofe, M. (1982) Nonlinear Renewal Theory in Sequential Analysis. Society for Industrial and Applied Mathematics, Philadelphia.

https://doi.org/10.1137/1.9781611970302

[10] Chow, Y.S. and Robbins, H. (1965) On the Asymptotic Theory of Fixed-Width Sequential Confidence Intervals for the Mean. Journal of Annals of Mathematical Statistics, 36, 457-462.

https://doi.org/10.1214/aoms/1177700156