Modeling Individual Patient Count/Rate Data over Time with Applications to Cancer Pain Flares and Cancer Pain Medication Usage

Show more

1. Introduction

An ongoing study (NIH/NINR 1R01NR017853) of patients with cancer is collecting daily longitudinal count/rate data including numbers of pain flares per day and numbers of as needed pain medications taken per day. Data are being collected for each study participant over periods of up to five months long. A completed study (NIH/NINR RC1NR011591) collected numbers for cancer patients over 3 months of around the clock pain medications taken per day per dose, that is, the number of times a medication is taken in a day relative to the number of doses that are supposed to be taken in a day. Standard assumptions of means linear in time and dispersions constant over time are not always appropriate for such data. Also, a Poisson process assumption of independence over time needs not always hold. A model selection score needs to be defined for evaluating models for the data and for use in searches over alternative models. Times to conduct these searches need to be as short as possible, especially as the number of time measurements increases.

Approaches are presented for modeling mean counts/rates over time separately for each individual patient controlling for temporal correlation as well as for time-varying dispersions. These approaches use Poisson regression methods, because count/rate data are being modeled. Generalized estimating equations (GEE) methods [1] [2] provide a natural choice for modeling correlations for such count variables. However, standard GEE methods have limited value, because they assume constant dispersions. Furthermore, GEE methods avoid specification of likelihood functions, which are useful for generating model selection criteria. In what follows, two extensions of GEE methods are formulated and evaluated that address temporal correlation and time-varying means and dispersions for repeated count/rate measurements. A likelihood-like function, that is, a function used like a likelihood but which needs not integrate to 1, is defined and used in computing parameter estimates for these extensions along with a model selection criterion for comparing alternative models. Example analyses of selected individual cancer patient count/rate data are presented using adaptive regression methods [3] for identifying possibly nonlinear trajectories for means and dispersions of counts/rates over time while controlling for temporal correlation.

2. Modeling Individual Count/Rate Data

2.1. Generalized Linear Modeling of Means

Let
${y}_{t\left(i\right)}$ denote count values for an individual patient observed at *N* distinct times within a general set *T* of times, that is,
$t\left(i\right)\in T=\left\{t\left(i\right):1\le i\le N\right\}$. Combine these into the
$N\times 1$ vector
$y$. Let
${\mu}_{t\left(i\right)}=E{y}_{t\left(i\right)}$ denote associated mean or expected counts and combine these into the
$N\times 1$ vector
$\mu $. Denote the residuals as
${e}_{t\left(i\right)}={y}_{t\left(i\right)}-{\mu}_{t\left(i\right)}$ for
$t\left(i\right)\in T$ and combine these into the
$N\times 1$ vector
$e=y-\mu $. Let
${x}_{t\left(i\right),j}$ denote predictor values over times
$t\left(i\right)\in T$ and over predictors indexed by
$1\le j\le J$ and combine these into the
$J\times 1$ vector
${x}_{t\left(i\right)}$ with transpose denoted by
${x}_{t\left(i\right)}^{\text{T}}$ for
$t\left(i\right)\in T$. Let
$X$ be the
$N\times J$ matrix with rows
${x}_{t\left(i\right)}^{\text{T}}$ for
$1\le i\le N$. Let
$\beta $ denote the associated
$J\times 1$ vector of coefficient parameters. Use generalized linear models [4] [5] of
${\mu}_{t\left(i\right)}$ for
$t\left(i\right)\in T$ with natural log link function
$h\left(\mu \right)={\mathrm{log}}_{e}\left(\mu \right)$ so that
$h\left({\mu}_{t\left(i\right)}\right)={x}_{t\left(i\right)}^{\text{T}}\cdot \beta $ for
$t\left(i\right)\in T$. When
${x}_{t\left(i\right),1}=1$ for
$t\left(i\right)\in T$, the first entry of
$\beta $ is an intercept parameter. Treat each
${y}_{t\left(i\right)}$ as Poisson distributed so that its variance is
$V\left({\mu}_{t\left(i\right)}\right)={\mu}_{t\left(i\right)}$.

The counts ${y}_{t\left(i\right)}$ sometimes have associated totals ${Y}_{t\left(i\right)}>0$, and then the model for the mean counts ${\mu}_{t\left(i\right)}$ is converted to a model for the means ${{\mu}^{\prime}}_{t\left(i\right)}$ of the rates ${{y}^{\prime}}_{t\left(i\right)}={y}_{t\left(i\right)}/{Y}_{t\left(i\right)}$ using offsets ${o}_{t\left(i\right)}=\mathrm{log}\left({Y}_{t\left(i\right)}\right)$. Formally, replace ${x}_{t\left(i\right)}^{\text{T}}\cdot \beta $ by ${x}_{t\left(i\right)}^{\text{T}}\cdot \beta +{o}_{t\left(i\right)}$ so that mean counts are ${\mu}_{t\left(i\right)}=\mathrm{exp}\left({x}_{t\left(i\right)}^{\text{T}}\cdot \beta +{o}_{t\left(i\right)}\right)$ and then

${{\mu}^{\prime}}_{t\left(i\right)}=E{{y}^{\prime}}_{t\left(i\right)}={\mu}_{t\left(i\right)}/{Y}_{t\left(i\right)}=\mathrm{exp}\left({x}_{t\left(i\right)}^{\text{T}}\cdot \beta \right)$

are the mean rates.

2.2. Time-Varying Dispersions

Let ${{x}^{\prime}}_{t\left(i\right),j}$ denote predictor values over times $t\left(i\right)\in T$ and over predictors indexed by $1\le j\le {J}^{\prime}$ and combine these into the ${J}^{\prime}\times 1$ vectors ${{x}^{\prime}}_{t\left(i\right)}$ for $t\left(i\right)\in T$. Let ${X}^{\prime}$ be the $N\times {J}^{\prime}$ matrix with rows ${{x}^{\prime}}_{t\left(i\right)}^{\text{T}}$ for $1\le i\le N$. Let ${\beta}^{\prime}$ denote the associated ${J}^{\prime}\times 1$ vector of coefficient parameters. Let ${\phi}_{t\left(i\right)}$ denote dispersion values over times $t\left(i\right)\in T$ satisfying $\mathrm{log}\left({\phi}_{t\left(i\right)}\right)={{x}^{\prime}}_{t\left(i\right)}^{\text{T}}\cdot {\beta}^{\prime}$ and define the extended variances as

${\sigma}_{t\left(i\right)}^{2}={\phi}_{t\left(i\right)}\cdot V\left({\mu}_{t\left(i\right)}\right)={\phi}_{t\left(i\right)}\cdot {\mu}_{t(\; i\; )}$

and the extended standard deviations as ${\sigma}_{t\left(i\right)}={\phi}_{t\left(i\right)}^{1/2}\cdot {\mu}_{t\left(i\right)}^{1/2}$ for $t\left(i\right)\in T$. Informally, these quantities extend the usual Poisson variances and standard deviations through multiplication by dispersions. These are used to compute the standardized residuals $std{e}_{t\left(i\right)}={e}_{t\left(i\right)}/{\sigma}_{t\left(i\right)}$. Combine the extended standard deviations into the $N\times 1$ vector $\sigma $. When ${{x}^{\prime}}_{t\left(i\right),1}=1$ for $t\left(i\right)\in T$, the first entry ${{\beta}^{\prime}}_{1}$ of ${\beta}^{\prime}$ is an intercept parameter. The constant dispersion model corresponds to ${{x}^{\prime}}_{t\left(i\right),1}=1$ for $t\left(i\right)\in T$ with ${J}^{\prime}=1$. This is the dispersion model used in standard GEE modeling.

When offsets ${o}_{t\left(i\right)}$ are used to convert the model for the counts ${y}_{t\left(i\right)}$ to a model for the rates ${{y}^{\prime}}_{t\left(i\right)}$, they can also be added to the dispersions. The dispersions then satisfy $\mathrm{log}\left({\phi}_{t\left(i\right)}\right)={{x}^{\prime}}_{t\left(i\right)}^{\text{T}}\cdot {\beta}^{\prime}+{o}_{t\left(i\right)}$ so that the extended variances for the counts ${y}_{t\left(i\right)}$ are

${\sigma}_{t\left(i\right)}^{2}={\phi}_{t\left(i\right)}\cdot {\mu}_{t\left(i\right)}=\mathrm{exp}\left({{x}^{\prime}}_{t\left(i\right)}^{\text{T}}\cdot {\beta}^{\prime}\right)\cdot \mathrm{exp}\left({x}_{t\left(i\right)}^{\text{T}}\cdot \beta \right)\cdot {\mathrm{exp}}^{2}\left({o}_{t(\; i\; )}\right)$

and then the variances for the rates ${{y}^{\prime}}_{t\left(i\right)}$ are

${{\sigma}^{\prime}}_{t\left(i\right)}^{2}={\sigma}_{t\left(i\right)}^{2}/{T}_{t\left(i\right)}^{2}={{\phi}^{\prime}}_{t\left(i\right)}\cdot {{\mu}^{\prime}}_{t(\; i\; )}$

where

${{\phi}^{\prime}}_{t\left(i\right)}=\mathrm{exp}\left({{x}^{\prime}}_{t\left(i\right)}^{\text{T}}\cdot {\beta}^{\prime}\right).$

2.3. Modeling Correlations

Denote the covariance matrix for the count vector $y$ as $\Sigma $. Use the GEE approach [1] [2] to model the covariance matrix for $y$ as $\Sigma =\text{Diag}\left(\sigma \right)\cdot R\left(\rho \right)\cdot \text{Diag}\left(\sigma \right)$ where $\text{Diag}\left(\sigma \right)$ is the $N\times N$ diagonal matrix with diagonal entries ${\sigma}_{t\left(i\right)}$ for $t\left(i\right)\in T$ and $R\left(\rho \right)$ is a $N\times N$ correlation matrix with diagonal entries 1 and off diagonal entries ${R}_{t\left(i\right),t\left({i}^{\prime}\right)}$ for $1\le i\ne {i}^{\prime}\le N$ determined by a correlation parameter $\rho $ varying with the assumed correlation structure. Under independent (IND) correlations, ${R}_{t\left(i\right),t\left({i}^{\prime}\right)}={\rho}_{\text{IND}}=0$ for $1\le i\ne {i}^{\prime}\le N$. A Poisson process generates such correlations. Under exchangeable (EXCH) correlations, ${R}_{t\left(i\right),t\left({i}^{\prime}\right)}={\rho}_{\text{EXCH}}$ for $1\le i\ne {i}^{\prime}\le N$, that is, the correlations are constant. Under autoregressive of order 1 (AR1) correlations, ${R}_{t\left(i\right),t\left({i}^{\prime}\right)}={\rho}_{\text{AR1}}^{\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|}$ for $1\le i\ne {i}^{\prime}\le N$ where $\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|$ is the absolute value of $t\left(i\right)-t\left({i}^{\prime}\right)$. These differences are all assumed to be integers so that the correlations ${R}_{t\left(i\right),t\left({i}^{\prime}\right)}$ are all well-defined. In general, ${R}_{t\left(i\right),t\left({i}^{\prime}\right)}$ are spatial AR1 correlations. The special case of non-spatial AR1 correlations with $t\left(i\right)=i$ treats times as equally spaced. The parameter ${\rho}_{\text{AR1}}$ is called the autocorrelation.

2.4. Possible Extensions

The above formulation can be extended to address repeated measurements of types other than counts/rates and for multiple patients. More complex correlation structures based on multiple correlation parameters can also be considered. One such example is unstructured correlations with different correlations for different pairs of measurements, but this requires data from multiple patients to be reasonably estimated. These extensions are not addressed further.

3. Standard Generalized Estimating Equations Modeling

3.1. Notation and Parameter Estimation

Under standard GEE modeling, dispersions are treated as a constant ${\phi}_{0}$ so that the covariance matrix satisfies

$\Sigma ={\phi}_{0}\cdot \text{Diag}\left({V}^{1/2}\left(\mu \right)\right)\cdot R\left(\rho \right)\cdot \text{Diag}({V}^{1/2}(\; \mu \; ))$

where $V\left(\mu \right)$ is the $N\times 1$ vector with entries $V\left({\mu}_{t\left(i\right)}\right)={\mu}_{t\left(i\right)}$ for $t\left(i\right)\in T$. The generalized estimating equations are given by $g\left(\beta \right)=0$ where 0 is the $J\times 1$ vector with all zero entries, $g\left(\beta \right)={D}^{\text{T}}\cdot {\Sigma}^{-1}\cdot e$, and the $N\times J$ matrix $D=\partial \mu /\partial \beta $ with entries ${D}_{t\left(i\right),j}=\partial {\mu}_{t\left(i\right)}/\partial {\beta}_{j}={x}_{t\left(i\right),j}\cdot {\mu}_{t\left(i\right)}$ for $t\left(i\right)\in T$ and $1\le j\le J$. Let $H\left(\beta \right)=-{D}^{\text{T}}\cdot {\Sigma}^{-1}\cdot D$. Note that in the general GEE context with correlated outcomes for multiple subjects, the formulation for $g\left(\beta \right)$ would equal a sum of terms like ${D}^{\text{T}}\cdot {\Sigma}^{-1}\cdot e$ for each subject and $H\left(\beta \right)$ would equal a sum of terms like $-{D}^{\text{T}}\cdot {\Sigma}^{-1}\cdot D$ for each subject. Only one such term is needed here since data for only one subject/patient are being modeled. The GEE process for estimating $\beta $ iteratively solves $g\left(\beta \right)=0$ as follows. Given the current value ${\beta}_{u}$ for $\beta $, the next value is given by ${\beta}_{u+1}={\beta}_{u}-{H}^{-1}\left({\beta}_{u}\right)\cdot g\left({\beta}_{u}\right)$, thereby adapting Newton’s method with $g\left(\beta \right)$ in the role of the gradient vector and $H\left(\beta \right)$ in the role of the Hessian matrix.

The constant dispersion parameter ${\phi}_{0}$ is estimated using the Pearson residuals $P{e}_{t\left(i\right)}\left(\beta \right)={e}_{t\left(i\right)}/{V}^{1/2}\left({\mu}_{t\left(i\right)}\right)$ evaluated at a given value for the mean coefficient parameter vector $\beta $. The bias-adjusted estimate ${\phi}_{0}\left(\beta \right)$ of the dispersion parameter ${\phi}_{0}$ satisfies ${\phi}_{0}\left(\beta \right)={\displaystyle {\sum}_{i=1}^{N}P{e}_{t\left(i\right)}^{2}\left(\beta \right)/\left(N-J\right)}$ assuming $N-J>0$. Next, the correlation parameter $\rho \left(\beta \right)$ is estimated using standardized errors

$std{e}_{t\left(i\right)}\left(\beta \right)={\phi}_{0}^{-1/2}\left(\beta \right)\cdot P{e}_{t\left(i\right)}(\; \beta \; )$

for $t\left(i\right)\in T$ as follows. The IND correlation structure has no need for an estimate. For the EXCH correlation structure and a given value $\beta $ for the mean parameter vector, ${\rho}_{\text{EXCH}}$ can be estimated by

${\rho}_{\text{EXCH}}\left(\beta \right)={\displaystyle {\sum}_{i=1}^{N-1}{\displaystyle {\sum}_{{i}^{\prime}=i+1}^{N}std{e}_{t\left(i\right)}\left(\beta \right)\cdot std{e}_{t\left({i}^{\prime}\right)}\left(\beta \right)/\left(N\cdot \left(N-1\right)/2-J\right)}}$

assuming $N\cdot \left(N-1\right)/2-J>0$. For the AR1 correlation structure and a given value $\beta $ for the mean parameter vector, the autocorrelation ${\rho}_{\text{AR1}}$ can be estimated by

${\rho}_{\text{AR1}}\left(\beta \right)={\displaystyle {\sum}_{i=1}^{N-1}{\left(std{e}_{t\left(i\right)}\left(\beta \right)\cdot std{e}_{t\left(i+1\right)}\left(\beta \right)\right)}^{1/\left(\left|t\left(i\right)-t\left(i+1\right)\right|\right)}/\left(N-1-J\right)}$

assuming $N-1-J>0$. In the non-spatial AR1 special case,

${\rho}_{\text{AR1}}\left(\beta \right)={\displaystyle {\sum}_{i=1}^{N-1}\left(std{e}_{t\left(i\right)}\left(\beta \right)\cdot std{e}_{t\left(i+1\right)}\left(\beta \right)\right)/\left(N-1-J\right)}$

because $\left|t\left(i\right)-t\left(i+1\right)\right|=1$ for $1\le i<N$.

For any correlation structure, once the GEE estimate $\beta \left(T\right)$ of the coefficient parameter vector $\beta $ is computed using the observations indexed by $t\left(i\right)\in T$, the GEE estimate of the dispersion parameter ${\phi}_{0}$ is ${\phi}_{0}\left(T\right)={\phi}_{0}\left(\beta \left(T\right)\right)$. The GEE estimate of the correlation parameter $\rho $ is $\rho \left(T\right)=\rho \left(\beta \left(T\right)\right)$ computed using $\beta \left(T\right)$ and ${\phi}_{0}\left(T\right)$.

3.2. The Likelihood-Like Function

Let $\theta ={\left({\beta}^{\text{T}}{\phi}_{0}\right)}^{\text{T}}$ be the $\left(J+1\right)\times 1$ vector of the GEE mean and dispersion parameters. The correlation parameter $\rho $ is a function of $\beta $ and ${\phi}_{0}$ and so has not been included in $\theta $. Use the multivariate normal likelihood to define the likelihood-like function $L\left(T;\theta \right)$ satisfying

$\mathcal{l}\left(T;\theta \right)=\mathrm{log}\left(L\left(T;\theta \right)\right)=-{e}^{\text{T}}\cdot {\Sigma}^{-1}\cdot e/2-\mathrm{log}\left(\left|\Sigma \right|\right)/2-N\cdot \mathrm{log}\left(2\cdot \pi \right)/2$

where $\left|\Sigma \right|$ is the determinant of the covariance matrix $\Sigma $. The vector $\partial \mathcal{l}\left(T;\theta \right)/\partial \beta $ of partial derivatives of $\mathcal{l}\left(T;\theta \right)$ can be expressed as the sum of two terms. The first term corresponds to differentiating the residual vector part $e$ of $\mathcal{l}\left(T;\theta \right)$ with respect to $\beta $ holding the covariance part $\Sigma $ fixed in $\beta $ and equals $g\left(\beta \right)$, the gradient-like quantity used in standard GEE modeling. This fact seems to have been first recognized by Chaganty [6] . One advantage for having a likelihood-like function for GEE models is that it can be used to compute parameter estimates. Another is that it can be used to compute model selection criteria not otherwise available for GEE modeling.

3.3. Likelihood-Like Cross-Validation

Burman [7] defined *k*-fold cross-validation with observations partitioned into *k* disjoint subsets called folds. Fold observations are predicted using parameter estimates computed using the data from the other folds. In *k*-fold likelihood-like cross-validation (LCV), these deleted fold predictions are scored using the associated likelihood-like function *L*. Randomly partition the times
$t\left(i\right)\in T$ into *k* disjoint folds
$T\left(f\right)$ for
$1\le f\le k$. Use the same initial seed for randomization with all models under consideration so that their LCV scores are comparable. Let
$\theta \left(T\backslash T\left(f\right)\right)$ denote the estimate of
$\theta $ using the data with times in the complement
$T\backslash T\left(f\right)$ of the fold
$T\left(f\right)$. For
$1\le f\le k$, let
${T}^{+}\left(f\right)$ denote the union of all folds
$T\left({f}^{\prime}\right)$ for
$1\le {f}^{\prime}\le f$ with
${T}^{+}\left(0\right)$ the empty fold and set
$L\left({T}^{+}\left(0\right);\theta \right)=1$. Define the LCV score to satisfy

$\text{LCV}={\displaystyle {\prod}_{f=1}^{k}{\text{LCV}}_{f}^{1/N}}$

where ${\text{LCV}}_{f}$ is defined as the conditional likelihood-like term for the data in fold $T\left(f\right)$ conditioned on the data in the union ${T}^{+}\left(f-1\right)$ of the prior folds using the deleted estimate $\theta \left(T\backslash T\left(f\right)\right)$ of the parameter vector $\theta $. Formally,

$\begin{array}{c}{\text{LCV}}_{f}=L\left(T\left(f\right)|{T}^{+}\left(f-1\right);\theta \left(T\backslash T\left(f\right)\right)\right)\\ =L\left({T}^{+}\left(f\right);\theta \left(T\backslash T\left(f\right)\right)\right)/L\left({T}^{+}\left(f-1\right);\theta \left(T\backslash T\left(f\right)\right)\right)\end{array}$

Because fold assignment is random, folds can be empty when the number *k* of folds is large relative to the number *N* of measurements, and then those folds are dropped from the computation of the LCV score. Larger LCV scores indicate better models. Note that even if the full data are non-spatial with observations at consecutive integer times
$t\left(i\right)=i$ for
$1\le i\le N$, the folds
$T\left(f\right)$ and the fold unions
${T}^{+}\left(f\right)$ are not consecutive integer times except in rare cases and so require more general handling.

4. Incorporating Nonconstant Dispersions

4.1. Formulation

GEE modeling can be extended to handle nonconstant dispersions. Let
$\theta ={\left({\beta}^{\text{T}}{{\beta}^{\prime}}^{\text{T}}\right)}^{\text{T}}$ be the
$\left(J+{J}^{\prime}\right)\times 1$ vector of the mean and dispersion parameters. The definition of the likelihood-like function
$L\left(T;\theta \right)$ given for standard GEE holds using the more general parameter vector
$\theta $. Differentiate
$\mathcal{l}\left(T;\theta \right)=\mathrm{log}\left(L\left(T;\theta \right)\right)$ with respect to the vector
${\beta}^{\prime}$ of dispersion coefficient parameters while holding the correlation parameter
$\rho $ fixed in the current parameter vector
${\beta}^{\prime}$ to provide the
${J}^{\prime}$ estimating equations
$g\left({\beta}^{\prime}\right)={\partial}^{\prime}\mathcal{l}\left(T;\theta \right)/{\partial}^{\prime}{\beta}^{\prime}=0$ where the notation
${\partial}^{\prime}\mathcal{l}\left(T;\theta \right)/{\partial}^{\prime}{\beta}^{\prime}$ is used to indicate that this is not the full partial derivative vector for
$\mathcal{l}\left(T;\theta \right)$ in
${\beta}^{\prime}$ due to not accounting for the effect of
${\beta}^{\prime}$ on
$\rho $. Now, combine these with the *J* standard GEE equations
$g\left(\beta \right)=0$ to solve for joint estimates of
$\beta $ and
${\beta}^{\prime}$. Then, iteratively solve for

$g\left(\theta \right)={\left(g{\left(\beta \right)}^{\text{T}}g{\left({\beta}^{\prime}\right)}^{\text{T}}\right)}^{\text{T}}=0$

with $g\left(\theta \right)$ in the role of the gradient vector and the $\left(J+{J}^{\prime}\right)\times \left(J+{J}^{\prime}\right)$ matrix $H\left(\theta \right)$ in the role of the Hessian matrix. $H\left(\theta \right)$ has four component submatrices: the $J\times J$ matrix $H\left(\beta \right)$ for the mean coefficients as defined for standard GEE, the ${J}^{\prime}\times {J}^{\prime}$ matrix $H\left({\beta}^{\prime}\right)={\partial}^{\prime}g\left({\beta}^{\prime}\right)/{\partial}^{\prime}{\beta}^{\prime}$ for the ${J}^{\prime}$ dispersion coefficients, the $J\times {J}^{\prime}$ matrix $H\left(\beta ,{\beta}^{\prime}\right)={\partial}^{\prime}g\left(\beta \right)/{\partial}^{\prime}{\beta}^{\prime}$, and its transpose $H\left({\beta}^{\prime},\beta \right)=H{\left(\beta ,{\beta}^{\prime}\right)}^{\text{T}}$.

Note that

${\mathrm{log}}_{e}\left(\left|\Sigma \right|\right)={\mathrm{log}}_{e}\left(\left|R\left(\rho \right)\right|\right)+{{{\displaystyle \sum}}^{\text{}}}_{i=1}^{N}{\mathrm{log}}_{e}\left({\phi}_{t\left(i\right)}\right)+{{{\displaystyle \sum}}^{\text{}}}_{i=1}^{N}{\mathrm{log}}_{e}\left(V\left({\mu}_{t\left(i\right)}\right)\right),$

${\phi}_{t\left(i\right)}=\mathrm{exp}\left({{x}^{\prime}}_{t\left(i\right)}^{\text{T}}\cdot {\beta}^{\prime}\right)$

and

${e}^{\text{T}}\cdot {\Sigma}^{-1}\cdot e=std{e}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde$

where $stde$ is the $N\times 1$ vector with entries $std{e}_{t\left(i\right)}={e}_{t\left(i\right)}/{\sigma}_{t\left(i\right)}$ for $t\left(i\right)\in T$. Consequently, $g\left({\beta}^{\prime}\right)$ has entries

${g}_{j}\left({\beta}^{\prime}\right)=stde{{x}^{\prime}}_{j}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde-{\displaystyle {\sum}_{i=1}^{N}{{x}^{\prime}}_{t\left(i\right),j}/2}$

for $1\le j\le {J}^{\prime}$ where $stde{{x}^{\prime}}_{j}$ is the $N\times 1$ vector with entries

$stde{{x}^{\prime}}_{t\left(i\right),j}={{x}^{\prime}}_{t\left(i\right),j}\cdot std{e}_{t\left(i\right)}/2$

for $t\left(i\right)\in T$. $H\left({\beta}^{\prime}\right)$ has entries

${H}_{j.{j}^{\prime}}\left({\beta}^{\prime}\right)=-stdex{{x}^{\u2033}}_{j.{j}^{\prime}}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde-stde{{x}^{\prime}}_{j}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde{{x}^{\prime}}_{{j}^{\prime}}$

for $1\le j,{j}^{\prime}\le {J}^{\prime}$ where $stdex{{x}^{\u2033}}_{j.{j}^{\prime}}$ is the $N\times 1$ vector with entries

$stdex{{x}^{\u2033}}_{t\left(i\right)j,{j}^{\prime}}={{x}^{\prime}}_{t\left(i\right),j}\cdot {{x}^{\prime}}_{t\left(i\right),{j}^{\prime}}\cdot std{e}_{t\left(i\right)}/4$

for $t\left(i\right)\in T$. $H\left(\beta ,{\beta}^{\prime}\right)$ has columns

$\begin{array}{l}{H}_{j}\left(\beta ,{\beta}^{\prime}\right)=-{D}^{\text{T}}\cdot \text{Diag}\left(\sigma inv{{x}^{\prime}}_{j}\right)\cdot {R}^{-1}\left(\rho \right)\cdot stde\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}-{D}^{\text{T}}\cdot \text{Diag}\left(1/\sigma \right)\cdot {R}^{-1}\left(\rho \right)\cdot stde{{x}^{\prime}}_{j}\end{array}$

where $\sigma inv{{x}^{\prime}}_{j}$ is the $N\times 1$ vector with entries $\sigma inv{{x}^{\prime}}_{t\left(i\right),j}={{x}^{\prime}}_{t\left(i\right),j}/\left(2\cdot {\sigma}_{t\left(i\right)}\right)$ for $t\left(i\right)\in T$ and $1\le j\le {J}^{\prime}$. If offsets are included, they are carried along in equations without any effect on derivatives.

4.2. Parameter Estimation

Given a value for the vector
$\theta $ of all coefficient parameters, an estimate of the correlation parameter
$\rho $ can be based on the associated standardized residuals
$std{e}_{t\left(i\right)}$. Calculate correlation estimates for the IND, EXCH, and AR1 correlation structures using the same formulas as before but computed with these more general standardized residuals. Iteratively solve
$g\left(\theta \right)=0$ as follows. Given the current value
${\theta}_{u}$ for
$\theta $, the next value is given by
${\theta}_{u+1}={\theta}_{u}-{H}^{-1}\left({\theta}_{u}\right)\cdot g\left({\theta}_{u}\right)$, thereby adapting Newton’s method with
$g\left(\theta \right)$ in the role of the gradient vector and
$H\left(\theta \right)$ in the role of the Hessian matrix. The solution to the estimating equations for observations indexed by *T* is denoted as
$\theta \left(T\right)={\left(\beta {\left(T\right)}^{\text{T}}{\beta}^{\prime}{\left(T\right)}^{\text{T}}\right)}^{\text{T}}$ with associated correlation estimate
$\rho \left(T\right)=\rho \left(\theta \left(T\right)\right)$.

5. Extended Linear Mixed Modeling

5.1. Formulation

GEE modeling can be further extended to handle full parameter estimation through maximizing the likelihood-like function. Let $\theta ={\left({\beta}^{\text{T}}{{\beta}^{\prime}}^{\text{T}}\rho \right)}^{\text{T}}$ be the $\left(J+{J}^{\prime}+1\right)\times 1$ vector of the mean, dispersion, and correlation parameters. The definition of the likelihood-like function $L\left(T;\theta \right)$ given for standard GEE holds using this more general parameter vector $\theta $. The likelihood-like function $L\left(T;\theta \right)$ is maximized in the coefficient parameter vector $\theta $ by solving the estimating equations

$g\left(\theta \right)=\partial \mathcal{l}\left(T;\theta \right)/\partial \theta =0$

where $\partial \mathcal{l}\left(T;\theta \right)/\partial \theta $ is the vector of standard partial derivatives of $\mathcal{l}\left(T;\theta \right)$. The associated matrix $H\left(\theta \right)=\partial g\left(\theta \right)/\partial \theta $. In this case, $g\left(\theta \right)$ is a true gradient vector and $H\left(\theta \right)$ a true Hessian matrix. This approach is extended linear mixed modeling in the sense that if the entries of $y$ were continuous variables treated as normally distributed with $V\left(\mu \right)=1$, then it would be exactly linear mixed modeling. Formulations given in what follows are adapted from those of [8] .

The gradient vector $g\left(\theta \right)={\left(g{\left(\beta \right)}^{\text{T}}g{\left({\beta}^{\prime}\right)}^{\text{T}}g\left(\rho \right)\right)}^{\text{T}}$. The gradient sub-vector $g\left({\beta}^{\prime}\right)=\partial \mathcal{l}\left(T;\theta \right)/\partial {\beta}^{\prime}$ has the same formulation as for extended GEE modeling, only now its entries are standard partial derivatives. The gradient subvector $g\left(\beta \right)=\partial \mathcal{l}\left(T;\theta \right)/\partial \beta $ has entries

${g}_{j}\left(\beta \right)=stde{x}_{j}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde-{\displaystyle {\sum}_{i=1}^{N}{x}_{t\left(i\right),j}/2}$

where $stde{x}_{j}$ is the $N\times 1$ vector with entries

$stde{x}_{t\left(i\right),j}={x}_{t\left(i\right),j}\left({y}_{t\left(i\right)}+{\mu}_{t\left(i\right)}\right)/(2\cdot {\sigma}_{t(\; i\; )})$

for $t\left(i\right)\in T$ and $1\le j\le J$. The partial derivative $g\left(\rho \right)=\partial \mathcal{l}\left(T;\theta \right)/\partial \rho $ satisfies

$g\left(\rho \right)=-std{e}^{\text{T}}\cdot \partial {R}^{-1}\left(\rho \right)/\partial \rho \cdot stde/2-\partial \left(\mathrm{log}\left(\left|R\left(\rho \right)\right|\right)\right)/\partial \rho /2$

where

$\partial \left(\mathrm{log}\left(\left|R\left(\rho \right)\right|\right)\right)/\partial \rho =tr\left({R}^{-1}\left(\rho \right)\cdot \partial R\left(\rho \right)/\partial \rho \right)$,

*tr* denotes the trace function, and

$\partial {R}^{-1}\left(\rho \right)/\partial \rho =-{R}^{-1}\left(\rho \right)\cdot \partial R\left(\rho \right)/\partial \rho \cdot {R}^{-1}\left(\rho \right).$

For IND correlations, $\partial R\left(\rho \right)/\partial \rho =0$. For EXCH correlations, $\partial R\left(\rho \right)/\partial \rho $ is the $N\times N$ matrix with diagonal entries all equal to 0 and off-diagonal entries all equal to 1. For AR1 correlations, $\partial R\left(\rho \right)/\partial \rho $ is the $N\times N$ matrix with diagonal entries all equal to 0 and off-diagonal entries equaling

$\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|\cdot {\rho}_{\text{AR1}}^{\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|-1}$

in the ${i}^{th}$ row and ${{i}^{\prime}}^{th}$ column for $1\le i\ne {i}^{\prime}\le N$.

$H\left(\beta \right)$ has nine component submatrices: the $J\times J$ matrix $H\left(\beta \right)=\partial g\left(\beta \right)/\partial \beta $ for the mean parameters, the ${J}^{\prime}\times {J}^{\prime}$ matrix $H\left({\beta}^{\prime}\right)=\partial g\left({\beta}^{\prime}\right)/\partial {\beta}^{\prime}$ for the dispersion parameters computed as for extended GEE modeling, the second partial derivative $H\left(\rho \right)=\partial g\left(\rho \right)/\partial \rho $ for the correlation parameter, the $J\times {J}^{\prime}$ matrix $H\left(\beta ,{\beta}^{\prime}\right)=\partial g\left(\beta \right)/\partial {\beta}^{\prime}$, and its transpose $H\left({\beta}^{\prime},\beta \right)=H{\left(\beta ,{\beta}^{\prime}\right)}^{\text{T}}$, the $J\times 1$ vector $H\left(\beta ,\rho \right)=\partial g\left(\beta \right)/\partial \rho $ and its transpose $H\left(\rho ,\beta \right)=H{\left(\beta ,\rho \right)}^{\text{T}}$, and the ${J}^{\prime}\times 1$ vector $H\left({\beta}^{\prime},\rho \right)=\partial g\left({\beta}^{\prime}\right)/\partial \rho $ and its transpose $H\left(\rho ,{\beta}^{\prime}\right)=H{\left({\beta}^{\prime},\rho \right)}^{\text{T}}$. $H\left(\beta \right)$ has entries

${H}_{j,{j}^{\prime}}\left(\beta \right)=-stdex{x}_{j,{j}^{\prime}}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde-stde{x}_{j}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde{x}_{{j}^{\prime}}$

for $1\le j,{j}^{\prime}\le {J}^{\prime}$ where $stdex{x}_{j,{j}^{\prime}}$ is the $N\times 1$ vector with entries

$stdex{x}_{t\left(i\right)j,{j}^{\prime}}={x}_{t\left(i\right),j}\cdot {x}_{t\left(i\right),{j}^{\prime}}\cdot std{e}_{t\left(i\right)}/4$

for $t\left(i\right)\in T$. The second partial derivative $H\left(\rho \right)$ satisfies

$H\left(\rho \right)=-std{e}^{\text{T}}\cdot {\partial}^{2}{R}^{-1}\left(\rho \right)/\partial {\rho}^{2}\cdot stde/2-{\partial}^{2}\left(\mathrm{log}\left(\left|R\left(\rho \right)\right|\right)\right)/\partial {\rho}^{2}/2$

where

$\begin{array}{l}{\partial}^{2}\left(\mathrm{log}\left(\left|R\left(\rho \right)\right|\right)\right)/\partial {\rho}^{2}=-tr\left({R}^{-1}\left(\rho \right)\cdot \partial R\left(\rho \right)/\partial \rho \cdot {R}^{-1}\left(\rho \right)\cdot \partial R\left(\rho \right)/\partial \rho \right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+tr\left(R{\left(\rho \right)}^{-1}\cdot {\partial}^{2}R\left(\rho \right)/\partial {\rho}^{2}\right).\end{array}$

For IND and EXCH correlations, ${\partial}^{2}R\left(\rho \right)/\partial {\rho}^{2}=0$. For AR1 correlations, ${\partial}^{2}R\left(\rho \right)/\partial {\rho}^{2}$ is the $N\times N$ matrix with diagonal entries all equal to 0 and off-diagonal entries equaling

$\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|\cdot \left(\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|-1\right)\cdot {\rho}_{\text{AR1}}^{\left|t\left(i\right)-t\left({i}^{\prime}\right)\right|-2}$

in the ${i}^{th}$ row and ${{i}^{\prime}}^{th}$ column for $1\le i\ne {i}^{\prime}\le N$. $H\left({\beta}^{\prime},\beta \right)$ has entries

${H}_{j,{j}^{\prime}}\left({\beta}^{\prime},\beta \right)=-stdex{{x}^{\prime}}_{j,{j}^{\prime}}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde-stde{x}_{j}^{\text{T}}\cdot {R}^{-1}\left(\rho \right)\cdot stde{{x}^{\prime}}_{{j}^{\prime}}$

for $1\le j,{j}^{\prime}\le {J}^{\prime}$ where $stdex{{x}^{\prime}}_{j,{j}^{\prime}}$ is the $N\times 1$ vector with entries

$stdex{{x}^{\prime}}_{t\left(i\right)j,{j}^{\prime}}={{x}^{\prime}}_{t\left(i\right),{j}^{\prime}}\cdot stde{x}_{t\left(i\right),j}/2$

for $t\left(i\right)\in T$, $1\le j\le J$, and $1\le {j}^{\prime}\le {J}^{\prime}$. $H\left(\beta ,\rho \right)$ has entries

${H}_{j}\left(\beta ,\rho \right)=stde{x}_{j}^{\text{T}}\cdot \partial R{\left(\rho \right)}^{-1}/\partial \rho \cdot stde$

for $1\le j\le J$. $H\left({\beta}^{\prime},\rho \right)$ has entries

${H}_{j}\left({\beta}^{\prime},\rho \right)=stde{{x}^{\prime}}_{j}^{\text{T}}\cdot \partial {R}^{-1}\left(\rho \right)/\partial \rho \cdot stde$

for $1\le j\le {J}^{\prime}$.

5.2. Parameter Estimation

The parameter vector $\theta $ is estimated by iteratively solving $g\left(\theta \right)=0$ as follows. Given the current value ${\theta}_{u}$ for $\theta $, the next value is given by

${\theta}_{u+1}={\theta}_{u}-{H}^{-1}\left({\theta}_{u}\right)\cdot g\left({\theta}_{u}\right)$,

thereby using Newton’s method with gradient vector
$g\left(\theta \right)$ and Hessian matrix
$H\left(\theta \right)$. The estimation process can be stopped early if
$\mathcal{l}\left(T;{\theta}_{u+1}\right)$ does not increase by much compared to
$\mathcal{l}\left(T;{\theta}_{u}\right)$. The solution to the estimating equations for observations indexed by *T* is denoted as
$\theta \left(T\right)={\left(\beta {\left(T\right)}^{\text{T}}{\beta}^{\prime}{\left(T\right)}^{\text{T}}\rho \left(T\right)\right)}^{\text{T}}$.

The covariance matrix for the parameter estimate vector
$\theta \left(T\right)$ can be computed as
$-{H}^{-1}\left(\theta \left(T\right)\right)$ and the variances corresponding to its diagonal entries can be used to compute *z* tests of zero individual model parameters. These are useful for fixed models of theoretical importance. On the other hand, tests for parameters of adaptively generated models (as described in Section 6) are usually significant as a consequence of the model selection process, and so results for these tests are not reported for models generated in the example analyses.

6. Modeling Possibly Nonlinear Means and Dispersion over Time

Knafl and Ding [3] provide a detailed formulation for adaptively searching through alternative regression models for means and dispersions in a variety of contexts using adaptive fractional polynomial models [9] . A brief overview is provided here. These methods are used in the example analyses of individual cancer patient count/rate data presented later. Model selection proceeds through two phases. The expansion phase first grows the model adding in alternative power transforms of predictors for means and dispersions. The contraction phase then reduces the model to a parsimonious set of power transforms by removing transforms from the current model one at a time and adjusting the powers of the remaining transforms. Alternative models are evaluated using LCV scores. The modeling process is controlled by tolerance parameters indicating how much of a reduction in the LCV score can be tolerated at given stages of the process. Knafl and Ding [3] also provide a wide variety of example analyses demonstrating the usefulness of these adaptive regression methods. A description of these methods in the standard Poisson regression context is provided in [10] .

A SAS^{®} (SAS Institute, Inc., Cary, NC) macro has been developed for generating adaptive analyses including the reported example analyses. This macro as well as data and code used to generate the results of the example analyses are available from the first author.

7. Example Analyses

7.1. Pain Flare Counts per Day

Figure 1 displays pain flare counts for Cancer Patient 1 over a period of 34 days.

Figure 1. Pain flare counts over time for cancer patient 1.

Pain flares range from 0 to 4 per day and tend to increase over time. Data are available for *N* = 33 days with a missing value for one day (day 33). These data were collected using Ecological Momentary Assessment (EMA) [11] as implemented in the mEMA app [12] .

Table 1 contains results for adaptive models for means and dispersions of pain flare counts over time using the two modeling approaches extended GEE modeling and extended linear mixed modeling and the three correlation structures IND, AR1, and EXCH. Power transforms reported in Table 1 were generated by adaptively searching through alternative power transforms using the methods described in Section 6. LCV scores are based on *k* = 5 folds with fold sizes ranging from 2 to 8 measurements and no empty folds. For extended GEE modeling, IND correlations generate the best LCV score 0.38018 over the three correlation structures. For extended linear mixed modeling, IND correlations also generate the best LCV score 0.40622. These results suggest that a Poisson process assumption is reasonable for these pain flare counts.

Extended linear mixed modeling generates better LCV scores than extended GEE for all three correlation structures. Moreover, computation times are much shorter ranging from 0.4 to 1.2 minutes compared to 13.9 to 35.5 minutes. These results suggest that extended linear mixed modeling is preferable for modeling these pain flare counts because it generates better LCV scores in less time. Consequently, only extended linear mixed modeling using IND correlations is considered further for these data, generating the model with means based on
$t{\left(i\right)}^{0.49}$ without an intercept and dispersions based on
$t{\left(i\right)}^{8.37}$ and
$t{\left(i\right)}^{0.5}$ without an intercept. Figure 2 displays estimates of mean pain flare counts over time along with unit error bands over time (*i.e.*, the mean ±1 extended standard deviation at each time) to account for variability about the means. Mean pain flare counts increase over time, somewhat close to linearly in time. Variability in pain flare counts is smaller in the middle of the period, somewhat larger at the start of the period and even larger at the end of the period. Figure 3 displays

Figure 2. Mean pain flare counts (middle curve) with unit error bounds for cancer patient 1.

Figure 3. Standardized residuals for the model of Figure 2.

Table 1. Adaptive models for means and dispersions of pain flare counts over time for alternative modeling approaches and correlation structures.

AR1—autoregressive of order 1; EXCH—exchangeable; GEE—generalized estimating equations; IND—independent; LCV—likelihood-like cross-validation; LMM—linear mixed modeling. a. The *i*^{th} time value is denoted as *t*(*i*). A 1 corresponds to an intercept parameter; otherwise, the model has a zero intercept. b. Difference in minutes of clock times between the start and end of computations.

standardized residuals for this model, which range between ±2 without any extreme outliers, suggesting the model is a reasonable fit for these data.

The associated model generated using *k* = 10 folds has similar means based on
$t{\left(i\right)}^{0.5}$ without an intercept and simpler dispersions based on
$t{\left(i\right)}^{0.2}$ without an intercept. However, the 10-fold LCV score 0.38107 is smaller, suggesting that *k* = 5 is a better choice for these data. Moreover, there is one empty fold, suggesting that the choice of *k* = 10 folds is too large for these data with only *N* = 33 measurements. The associated model generated with *k* = 5 folds and assuming constant dispersions has a similar model for the means based on
$t{\left(i\right)}^{0.53}$ without an intercept but a smaller LCV score 0.37031, suggesting that the dispersions for these data are reasonably treated as nonconstant over time.

7.2. As Needed Pain Medications Taken Counts per Day

Figure 4 displays as needed pain medications taken counts for Cancer Patient 2 over a period of 100 days. As needed pain medications taken counts range from 0 to 4 per day and tend to decrease over time. Data are available for *N* = 92 days with a missing value for eight other days (days 4, 14, 52, 56, 74, 81, 85, and 89). These data were also collected using the mEMA app.

Table 2 contains results for adaptive models for means and dispersions of as needed pain medications taken counts over time using the two modeling approaches extended GEE modeling and extended linear mixed modeling and the three correlation structures IND, AR1, and EXCH. LCV scores are based on *k* = 5 folds with fold sizes ranging from 13 to 21 measurements with no empty folds. For extended GEE modeling, AR1 correlations generate the best LCV score 0.41497 over the three correlation structures. For extended linear mixed modeling, AR1 correlations generate the best LCV score 0.40509. These results indicate that a Poisson process assumption may not be appropriate for these as needed pain medications taken counts.

Figure 4. As needed pain medications taken counts over time for cancer patient 2.

Table 2. Adaptive models for means and dispersions of as needed pain medications taken counts over time for alternative modeling approaches and correlation structures.

AR1—autoregressive of order 1; EXCH—exchangeable; GEE—generalized estimating equations; IND—independent; LCV—likelihood-like cross-validation; LMM—linear mixed modeling. a. The *i*^{th} time value is denoted as *t*(*i*). A 1 corresponds to an intercept parameter; otherwise, the model has a zero intercept. b. Difference in minutes of clock times between the start and end of computations.

Extended GEE modeling generates a better LCV score than extended linear mixed modeling for the IND correlation structure, but the scores for these two approaches are not too different. Extended linear mixed modeling generates a better LCV score than extended GEE modeling for the EXCH correlation structure. Extended GEE modeling generates a better LCV score than extended linear mixed modeling for the AR1 correlation structure. Although this is the best overall LCV score, the associated model for extended linear mixed modeling is more parsimonious with an intercept and one time transform for the means compared to two time transforms and constant dispersions compared to dispersions based on and intercept and one time transform. Moreover, computation times are substantially shorter for extended linear mixed modeling ranging from 0.5 to 1.7 minutes compared to 84.5 to 222.7 minutes or 1.4 to 3.7 hours. These results suggest that extended linear mixed modeling is preferable for modeling these as needed pain medications taken counts because it generates competitive or better scores or more parsimonious models in substantially less time. Consequently, only extended linear mixed modeling using AR1 correlations are considered further for these data, generating the model with means based on
$t{\left(i\right)}^{0.4}$ with an intercept, constant dispersions based on an intercept, and estimated autocorrelation
${\rho}_{\text{AR1}}=0.45$. Figure 5 displays estimates of mean as needed pain medications taken counts over time along with unit error bands over time (*i.e.*, the mean ±1 extended standard deviation at each time) to account for variability about the means. Mean as needed pain medications taken counts decrease nonlinearly over time. Variability in as needed pain medications taken counts is close to constant over time. Figure 6 displays standardized residuals for this model, which range well within ±3 without any extreme outliers, suggesting the model provides a reasonable fit to these data.

Figure 5. Mean as needed pain medications taken counts (middle curve) with unit error bounds for cancer patient 2.

Figure 6. Standardized residuals for the model of Figure 5.

The associated model generated using *k* = 10 folds is about the same with means based on
$t{\left(i\right)}^{0.5}$ with an intercept, constant dispersions, and estimated correlation
${\rho}_{\text{AR1}}=0.45$. The 10-fold LCV score 0.40958 is larger, suggesting that *k* = 10 is a better choice for these data. There are no empty folds. The associated model generated using *k* = 15 folds is similar with means based on
$t{\left(i\right)}^{0.5}$ with an intercept, dispersions based on
$t{\left(i\right)}^{0.07}$ without an intercept, and estimated correlation
${\rho}_{\text{AR1}}=0.46$. The 15-fold LCV score 0.40353 is smaller, suggesting that *k* = 10 is a better choice for these data. There are no empty folds. The associated model generated with *k* = 15 folds assuming constant dispersions has means based on
$t{\left(i\right)}^{0.5}$ with an intercept and close 15-fold LCV score 0.40318. Consequently, models generated by 5, 10, and 15 folds using extended linear mixed modeling are not too different, suggesting that the results are reasonably robust to the choice of the number of folds.

7.3. Around the Clock Pain Medications Taken Rates per Day per Dose

Adherence data for around the clock pain medications were collected using pill bottles equipped with Medication Event Monitoring System (MEMS) devices (AARDEX North America, Boulder, CO) that recorded the date and time of each pill bottle opening and presumably of the taking of the pain medication [13] [14] . Cancer Patient 3 was monitored for a period of 91 days. Counts of around the clock pain medications taken were computed for 30 equal-sized subperiods of 3.03 days each, ranging from 0 to 18. Around the clock pain medications were to be taken five times a day by this patient. Methods for modeling such data assuming the special case of a Poisson process with constant dispersions are provided in [10] . Figure 7 displays around the clock pain medications taken rates per day per dose for Cancer Patient 3. The ideal rate of 1 means that the patient took around the clock pain medications at the appropriate rate over the associated time subperiod. Around the clock pain medications taken rates range from 0 to 1.19 per day per dose and tend to decrease over time. Data are available for *N* = 30 subperiods with none missing.

Table 3 contains results for adaptive models for means and dispersions of around the clock pain medications taken rates over time using the two modeling approaches extended GEE modeling and extended linear mixed modeling and the three correlation structures IND, AR1, and EXCH. LCV scores are based on *k* = 5 folds with fold sizes ranging from 2 to 8 measurements with no empty folds. For extended GEE modeling, EXCH correlations generate the best LCV score 0.051583 over the three correlation structures. For extended linear mixed modeling, AR1 correlations generate the best LCV score 0.053856, which is also the best overall LCV score for Table 3 models. For both modeling approaches, the LCV score for IND correlations is quite a bit smaller than the best LCV score over the three correlation structures. These results indicate that a Poisson process assumption may not be appropriate for these around the clock pain medications taken rates.

Figure 7. Around the clock pain medications taken rates per day per dose over time for cancer patient 3.

Table 3. Adaptive models for means and dispersions of around the clock pain medications taken rates per day per dose over time for alternative modeling approaches and correlation structures.

AR1—autoregressive of order 1; EXCH—exchangeable; GEE—generalized estimating equations; IND—independent; LCV—likelihood-like cross-validation; LMM—linear mixed modeling. a. The *i*^{th} time value is denoted as *t*(*i*). A 1 corresponds to an intercept parameter; otherwise, the model has a zero intercept. b. Difference in minutes of clock times between the start and end of computations.

Extended linear mixed modeling generates better LCV scores than extended GEE for the IND and AR1 correlation structures. Its LCV score is smaller for the EXCH correlation structure, but its model is more parsimonious based on one time transform for the means with constant dispersions compared to one time transform plus an intercept for the means with constant dispersions. Furthermore, computation times are much shorter for extended linear mixed modeling ranging from 0.2 to 0.7 minutes compared to 5.1 to 11.9 minutes. These results suggest that extended linear mixed modeling is preferable for modeling these around the clock pain medications taken rates because it generates the best LCV score in less time. Consequently, only extended linear mixed modeling using AR1 correlations are considered further for these data, generating the model with means based on
$t{\left(i\right)}^{1.1}$ without an intercept, dispersions based on
$t{\left(i\right)}^{6.1}$ with an intercept, and estimated autocorrelation
${\rho}_{\text{AR1}}=0.75$. Figure 8 displays estimates of mean around the clock pain medications taken rates over time along with unit error bands over time (*i.e.*, the mean ±1 extended standard deviation at each time) to account for variability about the means. Mean around the clock pain medications taken counts decrease close to linearly over time. Variability in around the clock pain medications taken rates is larger at the end of the period. Figure 9 displays standardized residuals for this model, which range well within ±3 without any extreme outliers, suggesting the model provides a reasonable fit to these data.

The associated model generated using *k* = 10 folds is somewhat similar with means based on
$t{\left(i\right)}^{0.4}$ without an intercept, constant dispersions based on an intercept, and estimated autocorrelation
${\rho}_{\text{AR1}}=0.76$. However, the 10-fold LCV score 0.052023 is smaller, suggesting that *k* = 5 is a better choice for these data. Moreover, there is one empty fold, suggesting that the choice of *k* = 10

Figure 8. Mean around the clock pain medications taken rates per day per dose (middle curve) with unit error bounds for cancer patient 3.

Figure 9. Standardized residuals for the model of Figure 8.

folds is too large for these data with only *N* = 30 measurements. The associated model generated with *k* = 5 folds and assuming constant dispersions has a model for the means based on based on
$t{\left(i\right)}^{1.01}$ without an intercept, an autocorrelation estimate of
${\rho}_{\text{AR1}}=0.75$, and a smaller LCV score 0.0.050386, suggesting that the dispersions for these data are reasonably treated as nonconstant over time.

8. Discussion

8.1. Summary

Methods are formulated for modeling individual patient count/rate data over time allowing for nonlinear trajectories for means, time-varying dispersions, and temporal correlation. Three correlation structures are considered including IND, EXCH, and spatial AR1 correlations. Two extensions of standard GEE modeling are considered. Extended GEE modeling augments standard GEE mean parameter estimating equations with dispersion parameter estimating equations while using the GEE approach for correlation parameter estimation. Extended linear mixed modeling estimates all model parameters using estimating equations for mean, dispersion, and correlation parameters. These new estimating equations are determined by partial derivatives of a likelihood-like function based on the multivariate normal density. This likelihood-like function is also used to define a likelihood-like cross-validation (LCV) score for evaluating models. LCV scores are used to control adaptive regression modeling of possibly nonlinear means and dispersions over time. It is also possible to generate penalized likelihood-like criteria for model selection generalizing standard penalized likelihood criteria [15] such as the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC). Pan [16] has formulated a penalized model selection criterion related to the AIC called the quasi-likelihood information criterion (QIC) for GEE model selection, but the QIC score does not fully account for the correlation structure. Model selection criteria based on the likelihood-like function fully account for the correlation structure.

Example analyses using these methods are provided using three types of count/rate data for individual cancer patients including cancer pain flares per day, as needed cancer pain medications taken per day, and around the clock cancer pain medications taken per day per dose. Extended linear mixed modeling generates models with either better LCV scores or more parsimonious models than extended GEE modeling. Moreover, times to compute models are substantially smaller for extended linear mixed modeling than for extended GEE modeling. Time differences can be extreme for even moderate samples sizes, for example, analyses for the second example data set with 92 observations required at most 1.7 minutes for extended linear mixed modeling compared to up to 3.7 hours for extended GEE. These results indicate that extended linear mixed modeling is preferable for modeling individual patient count/rate data over time. This is likely to hold in more general modeling situations with other types of data and for combined data for multiple patients.

8.2. Alternative Approaches

The formulation provided here assumes that separate modeling of each patient’s longitudinal data is preferable to modeling the combined data for all patients. Separate modeling is a person-centered approach to modeling longitudinal data as opposed to a variable-centered approach using the combined data [17] [18] . This is only feasible when there are substantial numbers of time measurements for each patient. Modeling the combined data for all patients typically involves the assumption that means and dispersions for all patients are reasonably treated as having the same functional form. Knafl *et al.* [10] provide an example where this is not an appropriate assumption for a specific set of data on medication taken rates per day for HIV patients on antiretroviral medications. In any case, the methods considered here generalize to handle combined data for multiple patients, not only count/rate longitudinal data but also continuous and dichotomous longitudinal data, and not just data for cancer patients.

Multilevel (or hierarchical linear) modeling [19] [20] could alternatively be used to provide for individual patient differences, but that usually accounts for nonlinearity using polynomial models, often simple quadratic models. Polynomial models can be too simplistic for addressing general nonlinearity. Knafl and Ding [3] provide an example for independent data where the polynomial model generating the best LCV score for degrees 0-3 is the degree 0 constant model, but a nonlinear adaptive regression model generates a much better LCV score. Future research is needed to investigate general nonlinearity using adaptive regression methods applied to multilevel models as well as to random effects models [21] and to generalized linear mixed models [22] .

Spatial AR1 correlations generate better models than independent and exchangeable correlations for two of the three example data sets. This suggests consideration of autoregressive and/or moving average correlations [23] of orders more than 1. As the number of time points increases, even relatively large autocorrelations can generate small correlations for larger distances apart. For example, the third example data set had an estimated autocorrelation of ${\rho}_{\text{AR1}}=0.75$ with integer time measurements ranging from 1 to 30 so that the smallest correlation is ${0.75}^{29}=0.0002$. The second example data set had an even smaller estimated autocorrelation of ${\rho}_{\text{AR1}}=0.45$ with an even larger range of 1 to 100 integer time measurements so that the smallest correlation is ${0.45}^{99}=4.7\times {10}^{-35}$. These results suggest consideration of banded correlation autoregressive structures with zero correlations for measurements further apart than some fixed amount.

Acknowledgements

This work was supported in part by the National Institutes of Health/National Institute of Nursing Research Awards 1R01NR017853 and RC1NR011591. S. H. Meghani was the Principal Investigator for these research projects. G. J. Knafl was a consultant on both projects.

References

[1] Liang, K. and Zeger, S.L. (1986) Longitudinal Data Analysis Using Generalized Linear Models. Biometrika, 73, 13-22.

https://doi.org/10.1093/biomet/73.1.13

[2] Hardin, J. and Hilbe, J. (2013) Generalized Estimating Equations. 2nd Edition, Chapman and Hall/CRC, New York.

https://doi.org/10.1201/b13880

[3] Knafl, G.J. and Ding, K. (2016) Adaptive Regression for Modeling Nonlinear Relationships. Springer International Publishing, Switzerland.

[4] McCullagh, P. and Nelder, J.A. (1999) Generalized Linear Models. 2nd Edition, Chapman & Hall/CRC, Boca Raton.

[5] Wedderburn, R.W.M. (1974) Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss—Newton Method. Biometrika, 61, 439-447.

https://doi.org/10.1093/biomet/61.3.439

[6] Chaganty, N.R. (1997) An Alternative Approach to the Analysis of Longitudinal Data via Generalized Estimating Equations. Journal of Statistical Planning and Inference, 63, 39-54.

https://doi.org/10.1016/S0378-3758(96)00203-0

[7] Burman, P. (1989) A Comparative Study of Ordinary Cross-Validation, ν-Fold Cross-Validation and the Repeated Learning-Testing Methods. Biometrika, 76, 503-514.

https://doi.org/10.1093/biomet/76.3.503

[8] Wolfinger, R., Tobias, R. and Sall, J. (1994) Computing Gaussian Likelihoods and Their Derivatives for General Linear Mixed Models. Siam Journal on Scientific Computing, 15, 1294-1310.

https://doi.org/10.1137/0915079

[9] Knafl, G.J. (2018) Adaptive Fractional Polynomial Modeling. Open Journal of Statistics, 8, 159-186.

https://doi.org/10.4236/ojs.2018.81011

[10] Knafl, G.J., Fennie, K.P., Bova, C. and Williams, A.B. (2004) Electronic Monitoring Device Event Modeling on an Individual-Subject Basis Using Adaptive Poisson Regression. Statistics in Medicine, 23, 783-801.

https://doi.org/10.1002/sim.1624

[11] Shiffman, S., Stone, A.A. and Hufford, M.R. (2008) Ecological Momentary Assessment. Annual Review of Clinical Psychology, 4, 1-32.

https://doi.org/10.1146/annurev.clinpsy.3.022806.091415

[12] Ilumivi (n.d.) Software for Humanity.

https://ilumivu.com/solutions/ecological-momentary-assessment-app/

[13] Meghani, S.H. and Knafl, G.J. (2016) Patterns of Analgesic Adherence Predict Health Care Utilization among Outpatients with Cancer Pain. Patient Preference Adherence, 10, 81-98.

https://doi.org/10.2147/PPA.S93726

[14] Meghani, S.H., Persico, A., Fudin, J. and Knafl, G.J. (2021) Gaps in the Use of Long-Acting Opioids within Intervals of Consecutive Days among Cancer Outpatients Using Electronic Pill Caps, Pain Medicine, 22, 687-693.

https://doi.org/10.1093/pm/pnaa273

[15] Sclove, S.L. (1987) Application of Model-Selection Criteria to Some Problems in Multivariate Analysis. Psychometrika, 52, 333-343.

https://doi.org/10.1007/BF02294360

[16] Pan, W. (2001) Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics, 57, 120-125.

https://doi.org/10.1111/j.0006-341X.2001.00120.x

[17] Bergmann, L.R. and Trost, K. (2006) The Person-Centered Versus the Variable-Centered Approach: Are They Complementary, Opposites, or Exploring Different Worlds? Merrill-Palmer Quarterly (Wayne State University Press), 52, 601-632.

https://doi.org/10.1353/mpq.2006.0023

[18] Laursen, B. and Hoff, E. (2006) Person-Centered and Variable-Centered Approaches to Longitudinal Data. Merrill-Palmer Quarterly (Wayne State University Press), 52, 377-389.

https://doi.org/10.1353/mpq.2006.0029

[19] Bryk, S.W. and Raudenbush. A.S. (2002) Hierarchical Linear Models: Applications and Data Analysis Methods. 2nd Edition, Sage Publications, Thousand Oaks.

[20] Luke, D.A. (2004) Multilevel Modeling. Sage Publications, Thousand Oaks.

https://doi.org/10.4135/9781412985147

[21] Laird, N.M. and Ware, J.H. (1982) Random-Effects Models for Longitudinal Data. Biometrics, 38, 963-974.

https://doi.org/10.2307/2529876

[22] Stroup, W.W. (2012) Generalized Linear Mixed Models: Modern Concepts, Methods and Applications. Chapman & Hall/CRC, Boca Raton.

[23] Box, G.E.P., Jenkins, G.M. and Reinsel, G.C. (2008) Time Series Analysis: Forecasting and Control. 4th Edition, John Wiley & Sons, Hoboken.