Many research studies seek to predict related outcomes given a set of independent variables or to quantify the relationship between them. In certain instances, the outcome of interest is ordinal. Ordinal variables are defined as having distinct ordered levels; however, the distance between the levels cannot be ascertained. An example of an ordinal variable is cancer stage. Take, for instance, testicular seminoma, a germ cell tumor in the sperm of the testes  . This cancer can be classified according to stage. The stages are:
1) Tumor stage 1, cancer has not spread beyond the testicle.
2) Tumor stage 2, cancer has spread to the blood or lymphatic vessels.
3) Tumor stage 3, cancer has spread beyond the lymphatic and blood vessels nodes to the spermatic cord.
4) Tumor stage 4, cancer has spread beyond previously mentioned areas to other parts of the body.
The ordering of categories is evident. The aim of statistical and machine learning models is to quantify the relationship between covariates and associated outcome so that one can predict the outcome variable and assess the relationship between the two with statistical significance. The range of ordinal outcome models includes cumulative logit, proportional odds model, adjacent-category logit  , and stereotype logit  . These procedures assume there are more observations than independent variables, or covariates. Another assumption is the resulting parameter estimates follow a normal distribution.
In addition, we now live in an era of high dimensional data, and massive amounts of information are being collected  . These data are used to better understand and analyze related issues. However, this comes at a cost, and traditional methods are ill-equipped to utilize these datasets. These data may have more variables than observations. The distributions of the parameter estimates may not follow a normal distribution. We may collect genetic data, demographic data, and clinical data, resulting in an analysis data set containing thousands of variables with a few hundred observations when evaluating health conditions with associated ordinal outcomes  ; the distribution of the parameter estimates may not follow a normal, or other known, distribution.
Penalized ordinal outcome models were developed to analyze high dimensional data with ordinal outcomes. Some of these modeling schemes are glmnetcr  , ordinalgmifs  , and penalized stereotype logit models  . Apart from the stereotype logit, these models are linear in that the objective cost functions can be represented as a linear combination of parameters estimates; these models also assume the cost function is convex. For the stereotype logit, which includes nonlinear combinations of parameter estimates, optimization algorithms that assumed linearity, and function convexity, were applied   . A nonlinear and nonconvex approach to optimize the cost function of the penalized stereotype logit should be explored.
This study investigates the extension of a previously developed elastic net penalized stereotype logit  . We add an elastic net penalty  to the stereotype logit model  . To optimize the penalized function, we use the Adam optimizer  which is suited to nonlinear functions. This, in turn, allows us to evaluate the prediction accuracy of the model, which we were not able to do previously. The updated modeling procedure is presented with first order derivatives, optimization procedure, and a bootstrap resampling scheme to assess variable importance. Said modeling procedure is applied to simulated and real-world datasets with reported results. The proposed method is compared with the ordinalgmifs implemented L1 penalized stereotype logit  , an existing method for analyzing high dimensional data with an ordinal outcome.
For a given observation i (there are a total of n observations), denote the outcome vector as where if for that observation, the outcome is in the jth category, and all other entries are set to 0. There are J possible outcomes. Denote the vector as the covariate vector consisting of p values. The log of the information entropy, based on a multinomial distribution, is represented as
The log of the odds ratio, with level J being the reference level, , is represented as . Therefore, are now modeled as
This representation is known as the stereotype logit  . For each ordered level, the effect of the independent variables is equal to an overall effect, , multiplied by a value , which is referred to as the intensity parameter. Primarily, as we are concerned with modeling the log of the odds ratios, , the log of the information entropy can now be written as
2.1. Elastic Net Penalized Stereotype Logit
We take the log of the information entropy, with a stereotype logit parameterization, and add an elastic net penalty  . A penalty on the sum of the squared and absolute values of the parameters is enforced   . For a set of parameters, represented in a p length vector , the elastic net penalty is defined as
where can vary and n is the sample size of the dataset. For this study, was set to 0.5. The goal of the elastic net is to penalize large values of the parameter estimates, forcing their magnitude to decrease in proportion to their size. During optimization, the system is forced shrink the parameter estimates’ size when finding an optimal solution.
Based on the log of the information entropy, we are concerned with finding estimates for parameters, , , and such that
where denotes the vector of length J − 1 containing the intercepts for the J − 1 logits and denotes the vector on length J − 1 containing the intensity parameters. In addition, minimizing the negative log entropy is equivalent to maximizing the log entropy, and we will work with the negative representation. Therefore, after imposing the elastic net penalty, we are concerned with finding parameter estimates such that:
In machine learning, for any given model there is usually a hyperparameter set, a set of parameters that is not optimized over but whose choice of values affects the final solution. In machine learning, if there are multiple hyperparameters for any optimization procedure, there is still no established method to select these to optimize the function with respect to the parameters  . For the purposes of this manuscript, the hyperparameters were set to given values (a small range of values was considered with the optimal set being selected); λ was set to 0.001. The partial first derivatives with respect to αj, βk, and ϕj are presented.
Denote the full parameter set β, α, and ϕ as ψ. The partial derivatives are vectorized (placed into one vector) and are represented by a derivative vector, denoted .
2.2. Adam Optimization
The implemented Adam algorithm  attempts to find a parameter set that will minimize model Equation (8). The approach uses non-linear programs to find optimal solutions given a hyperparameter set. The Adam optimizer combines the idea of momentum optimization  and RMSProp  . Adam keeps track of an exponentially decaying average of past gradients from previous iterations (momentum optimization). Adam also tracks the exponentially decaying average of past squared gradients from previous iterations (RMSProp). The applied algorithm is listed below.
1) Initialize m and s to have all zero entries; these vectors are of length .
2) Initialize ψ using He Initialization  .
3) Compute .
7) Repeat steps 3 through 6 until , where i references the iteration number, or until a prespecified number of iterations are reached.
The vectors and contain the exponentially decaying averages of and . For the He initialization  , the parameter estimate vector ψ is initialized using the random normal function of the form
where are randomly generated values from a normal distribution with a mean 0 and standard deviation 1 and p is the number of covariates in the dataset. The hyperparameter set consists of ν1, τ2, η, ε, ζ, and ς. For this study, after considering a small range of candidate values, ν1 was set to 0.5, τ2 was set to 0.8, η was set to 0.008, ε was set to 1E-7, ζ was set to 1E-5, λ was set to 0.001 and, ς was set to 0.5. Steps three through six are repeated until a specified number of iterations is reached (800 for this study) or until . In applying this algorithm, we need to include an adjustment for the elastic net penalty. When taking derivatives with respect to β, we adjust these functions by subtracting the derivatives of the elastic net penalty. This is not done for α or ϕ. As a result, when computing the derivatives for the βk, where k references the iteration, we subtract from that derivative term which leads to the derivatives for each β subject to the elastic net penalty. For each iteration, in addition to modifying β by subtracting a function of its derivative we also shrink the parameters by a factor of . This method was implemented in the R programming environment  . Functions from the MASS  and matrixcalc  R packages were used to implement the proposed model.
2.3. Applied Bootstrap Resampling Procedure
For the proposed model, the standard errors of our parameter estimates are computed using a bootstrapping pairs design  . Denote B as the number of resamples without replacement. For this study, B is set to 200  . For each bootstrap resample, we resample n tuples with replacement, which gives us the dataset and , . The proposed model is then fit to each resampled data set. Once the B models are fit, the corresponding parameter estimates are obtained. Denote the bth bootstrap parameter estimates as . Having these B parameter estimates allows us to estimate their standard errors and construct confidence intervals.
The bootstrap-t confidence interval method is used to construct confidence intervals. The bootstrap-t confidence intervals are of the form
with being defined as
where denotes the estimate of from the bth bootstrapped resamples dataset. In addition, is chosen from the standard normal distribution such that
where is defined as
The R programming environment  was used to implement this procedure.
3. Application to Simulated Data
The simulation procedure used is the same as previously presented  with a few noted changes. In that study, one dataset was simulated using a compound symmetric correlation structure for the covariates with ρ = 0.01. In addition, we simulated three additional dataset types. The second dataset type was simulated using a first order autoregressive, AR(1), correlation structure with ρ set to 0.1. The third dataset type has a Toeplitz correlation structure with each ρ generated randomly using a uniform distribution U (0, 0.4). The fourth dataset type has an unstructured correlation structure with each ~U (0, 0.4). For all simulated datatypes, 20 covariates (10 are significant) were generated for 1000 observations. Among the 10 significant parameter estimates, 5 were randomly set at 0.5 and 5 at −0.5 for all datasets. Each dataset was centered and scaled before the proposed model was fit. For each datatype, 100 datasets were simulated. The described bootstrap resampling technique was used to provide 95% confidence intervals; B = 200 resamples were used. In the model fitting process, the data were split into training: test data in the ratio 8:2. The model was developed using the training data. The final model was applied to the test dataset (independent data not used to build the model). The main criteria examined were the number of significant covariates that have non-zero parameter estimates, the number of non-significant coefficients that have estimates close to zero within a threshold, the accuracy of predictions when the model is applied to the test data set, and execution times. Functions from the R packages MASS  , mvtnorm   , futility  , MBESS  , Matrix  , and corpcor  were employed to implement this procedure.
The proposed methodology, and ordinalgmifs method with the option probability, model = “Stereotype”, was applied to the simulated data. The goal was to compare two implementations of penalized stereotype logit models. Table 1 presents the mean, and standard deviations, for prediction accuracy (determined by the test data) and execution times for both methods. Two-sided, two sample Welch’s t tests, with significance level of 0.05, were used to compare mean accuracy and execution times for both methods. For the proposed method, the average prediction accuracy for the compound symmetric simulated data is 96.1%; for AR(1) correlated data, 96.52% of the observations were correctly classified. This rate was 96.24% for the Toeplitz correlated data and 96.49% for the unstructured correlated data. Regarding classification, the proposed method outperformed the ordinalgmifs method on all datasets as determined by the t tests (all p values < 0.001). Regarding execution times, the proposed method executed faster for all datasets considered (all p values < 0.001). The average execution times for the proposed method range from 17.21 to 23.63 seconds, for the ordinalgmifs method the range is 166.98 to 200.06 seconds. The proposed method executed ~10-fold faster on average.
Tables 2-5 present the parameter estimates for the 10 significant parameters based on the proposed method. For all simulated datasets, the 10 non-significant parameters (not shown in the tables) had a maximum absolute value of 0.04; these values were close to 0. For the ordinalgmifs method, all non-significant parameters had estimates of 0. The proposed method selects the significant parameters that are truly related to the outcome while setting estimates of the non-significant parameters close to 0. In comparison, the ordinalgmifs methods set these values to 0. The confidence intervals are somewhat narrow (~0.5) for parameter estimates of significant covariates.
Figure 1 presents the percent of observations correctly classified per iteration for the training data of the simulated datasets. The method performance never decreases for all simulated datasets. Once the method maximizes the proportion that it can correctly estimate, it oscillates around that value. This could be due in part to the use of the Adam optimization algorithm  . The results indicate that the proposed model framework is adept at variable selection and classification capabilities when applied to independent datasets. The proposed model outperforms the ordinalgmifs implementation with regards to prediction accuracy and execution times. Computationally, the proposed model executes faster than the ordinalgmifs implementation by ~10-fold. The analysis was performed in the R programming environment  .
Figure 1. Plots presenting the percent correctly classified per iteration for the four simulated datasets. “The black line demonstrates the progress (% correctly classified) for the training data”.
Table 1. Average accuracy (% correctly classified) and execution times for proposed and ordinalgmifs methods, along with standard deviations.
For the two methods, accuracy and executions times were compared using a two-sided Welch’s two sample t test, with significance level of 0.05. “*” indicates a statistically significant difference.
Table 2. Parameter estimates and 95% confidence intervals for truly important variables included in the final model of the compound symmetric correlated data.
Table 3. Parameter estimates and 95% confidence intervals for truly important variables included in the final model of the AR(1) correlated data.
Table 4. Parameter estimates and 95% confidence intervals for truly important variables included in the final model of the Toeplitz correlated data.
Table 5. Parameter estimates and 95% confidence intervals for truly important variables included in the final model of the unstructured correlated data.
4. Application to NHL Data
The data came from a study titled “Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets”  . The primary aim of this manuscript was to find gene expression profiles that could predict, with some degree of error, molecular subtypes of diseases. The cancer types evaluated by this manuscript were lymphoma and breast cancer. Lymphoma is defined as a cancer of the lymphatic system. The following review is taken from Cancer Stat Facts  . NHL make up approximately 90% of all malignant lymphomas, with the Hodgkin lymphomas accounting for the remaining 10%. NHL “is a heterogeneous disease resulting from the malignant transformation of lymphocytes and includes multiple subtypes each with specific molecular and clinical characteristics”  . NHL can either start in the B-lymphocytes or the T-lymphocytes. Among B-cell lymphomas, diffuse large B-cell lymphomas are the most common. T-cell lymphomas account for 15% of NHL in the United States. NHL account for 4.3% of all new cancer cases. There were 72,240 estimated cases for 2017 and 20,140 estimated deaths. The median age of diagnosis was 67, with the highest proportion of new cases occurring in the 65 - 74 age group. The estimated 5-year survival rate was 71.0%. The issue of stage prediction with NHL, using a set of covariates, provides an opportunity to evaluate the performance of the proposed model framework.
The raw data, DLBCL-A: data set and DLBCL-A: class labels, were downloaded from http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi  . The data were generated using one-channel oligonucleotide microarrays. The data have three subtypes, designated as oxidative phosphorylation (OxPhos), B-cell response (BCR), and host response (HR)  . The independent variables are gene expression values. The R package CePa  was used to read in the datasets. All variables were input into the model. The gene expression values were standardized (centered and scaled) prior to model fitting. There were 661 genes in the dataset. Among the 141 samples, 49 were OxPhos, 50 were BCR, and 42 were HR. The proposed model, along with the ordinalgmifs implementation of the stereotype logit, was applied to the gene expression dataset, with associated outcome vectors, to select genes associated with NHL subtypes. The described bootstrap resampling procedure was applied, yielding estimates of standard errors that were used to compute 95% confidence intervals. B = 200 resamples were used. Due to the small sample size, leave-one-out cross validation was used to estimate the predictive capabilities of the model. The analysis was performed in the R programming environment  .
Table 6 shows selected genes along with parameter estimates and confidence intervals. The displayed genes are those with the largest absolute value of the coefficient of variation, with standard deviations provided by the bootstrap resampling scheme. Only the top 20 were displayed. The corresponding gene names are also presented. The names of the genes are provided by the HUGO Gene Nomenclature Committee (https://www.genenames.org/)  . As with the results in the simulation section, the 95% confidence intervals are narrow, and prediction accuracy was 73%. When applying the stereotype logit-based ordinalgmifs function to the NHL data, parameter estimates could not be obtained due to optimization issues with that implementation. The following error was reported by the R programming environment “Error in optim(c(alpha, phi), fn.stereo, w = w, x = x, beta = beta, y = y: L-BFGS-B needs finite values of ‘fn’”. The above error relates to the optimization function being passed infinite values during the optimization process. All covariate values were centered and scaled prior to analysis. To correct the error, multiple values of the hyperparameters were passed to the ordinalgmifs function in R with no success. The data was checked for missing values; there were none. In addition, all the data were numeric. As a result, no comparison could be made with the ordinalgmifs implementation of the L1 penalized stereotype logit.
Table 6. Variable importance based on the application of the proposed model to the NHL dataset. The topmost 20 genes, in terms of variable importance, are presented. The model achieved a prediction accuracy of 73%.
There are multiple hyperparameters in the elastic net constrained stereotype logit, optimal values for these must be explored as this has the potential to increase variable selection and classification capabilities. This is usually accomplished with a grid search and is an open problem in machine learning for multiple hyperparameters  . For this study, a small selection of values was considered for each hyperparameter. The hyperparameter of interest is λ but the choice for the remaining hyperparameters is also very important in determining the optimal solution.
A bootstrap resampling procedure was used to estimate the 95% confidence intervals. The main drawback is the computational time required to produce the confidence intervals with 200 additional models being fit. It may be advisable to perform a closed form estimate of the parameter variance matrix  .
Although the stereotype logit is considered by many a generalized linear model, it is not. As such, an optimal solution may not exist, or there may be inflexion points. As a result, different starting values may yield different solutions. In this study, applying the method to a given dataset does not exhibit a great deal of variation in results, and the results of the applied bootstrap procedure confirm this. To address this, we applied a variable initialization scheme proposed by He  . In addition, the Adam optimization function is well suited to dealing with non-convex functions  . The combination of these two factors addresses this issue.
A proposed model for the elastic net penalized stereotype logit model, with optimization provided by the Adam optimizer, to analyze ordinal outcome data was presented. The proposed method was applied to simulated and NHL data with reported results. For the simulated data, variable selection was perfect, and only significant variables had parameter estimates not close to 0. The classifications ranged from 96.1% to 96.52% on the test datasets. For the NHL data, 73% were correctly classified. The 20 topmost genes in terms of absolute value of the coefficient of variation were presented. Our evaluation study shows that the proposed method outperforms the ordinalgmifs penalized stereotype logit model; no comparison could be made with the NHL data analysis as the ordinalgmifs implemented stereotype logit was not able to produce parameter estimates. This manuscript is an extension of previous work  . In the previous study, variable selection was adequate, but the classification capabilities were lacking. This work improves the prediction accuracy when applied to simulated and NHL data (ranging from 73% to 96.52%). In addition, the variable importance also improved with only the significant parameters having non-zero estimates. This study demonstrates, with success, the application of the Adam optimizer to the elastic net penalized stereotype logit model to analyze ordinal outcome data with promising results, as demonstrated on the simulated and NHL datasets.
First and foremost, I would like to thank God from whom all blessings flow. I would also like to express special thanks to Timothy Wysocki, the Co-director of the Center for Healthcare Delivery Science, at Nemours Children’s Specialty Care for allowing me to work on this project.
 Williams, M. and Schnellhammer, P. (2016) Testicular Seminoma.
 Anderson, J.A. (1984) Regression and Ordered Categorical Variables. Journal of the Royal Statistical Society: Series B (Methodological), 46, 1-30.
 Archer, K.J. and Williams, A.A.A. (2012) L1 Penalized Continuation Ratio Models for Ordinal Response Prediction Using High-Dimensional Datasets. Statistics in Medicine, 31, 1464-1474. https://doi.org/10.1002/sim.4484
 Archer, K.J., Hou, J., Zhou, Q., Ferber, K., Layne, J.G. and Gentry, A.E. (2014) Ordinalgmifs: An R Package for Ordinal Regression in High-Dimensional Data Settings. Cancer Informatics, 13, CIN.S20806. https://doi.org/10.4137/CIN.S20806
 Williams, A.A. and Archer, K.J. (2015) Elastic Net Constrained Stereotype Logit Model for Ordered Categorical Data. Biometrics & Biostatistics International Journal, 2, Article ID: 00049. https://doi.org/10.15406/bbij.2015.02.00049
 Hastie, T., Taylor, J., Tibshirani, R., Walther, G., Boyd, S., Friedman, J., et al. (2007) Forward Stagewise Regression and the Monotone Lasso. Electronic Journal of Statistics, 1, 1-29. https://doi.org/10.1214/07-EJS004
 Zou, H. and Hastie, T. (2005) Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Methodological), 67, 301-320.
 Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.
 Polyak, B.T. (1964). Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 4, 1-17.
 Hinton, G., Srivastava, N. and Swersky, K. (n.d.) Neural Networks for Machine Learning. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
 R Core Team (2017) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.r-project.org/
 Efron, B. and Tibshirani, R. (1986) Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Methods of Statistical Accuracy. Statistical Science, 1, 54-75. https://doi.org/10.1214/ss/1177013815
 Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F. and Hothorn, T. (2017) Mvtnorm: Multivariate Normal and t Distributions.
 Zhuang, Y., Juraska, M., Grove, D., Gilbert, P. and Luedtke, A. (2017) Futility: Interim Analysis of Operational Futility in Randomized Trials with Time-to-Event Endpoints and Fixed Follow-Up. https://cran.r-project.org/package=futility
 Schafer, J., Opgen-Rhein, R., Zuber, V., Ahdesmaki, M., Silva, A.P.D. and Strimmer, K. (2017) Corpcor: Efficient Estimation of Covariance and (Partial) Correlation.
 Hoshida, Y., Brunet, J.P., Tamayo, P., Golub, T.R. and Mesirov, J.P. (2007) Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets. PLoS ONE, 2, e1195. https://doi.org/10.1371/journal.pone.0001195
 HUGO Gene Nomenclature Committee (n.d.) HGNC Database of Human Genes.