Introduction Popularized in the work by David Holpert and William Macready, the No Free Lunch (NFL) Theorem states that no single machine learning algorithm is better than all the others on all problems . Other researchers have tried multiple models to find one that works best for a problem. In fact, studies by Carp   illustrate effects on research findings in functional MRI (fMRI) studies due to variations in analytic strategy, with increased model flexibility leading to higher rates of false positive results. Wagenmakers, et al.  point out that many studies in psychology do not commit to an analysis method before seeing the data, with some researchers fine-tuning their analysis to the data, proposing that researchers “preregister their studies and indicate in advance the analyses they intend to conduct” in order to be considered as “confirmatory” research, rather than as “exploratory”. A study published in May 2020 expands on Carp’s findings, noting that fMRI analyses conducted on the same data by seventy different laboratories produced a wide range of results . This particular study highlighted the fact that fMRI analysis requires several stages of pre-processing and analysis to determine which areas of the brain show activity. They found that the choice of pre-processing pipeline led to widely varied results. Among the seventy study teams, no two teams selected the same pipeline. Figure 1 illustrates the potential implications of varying pipeline choices in neuroimaging.
Perhaps the most illustrative lack of research consistency is the study by Silberzahn, et al.  which recruited 29 independent research teams with 61 analysts to address the question, “Are soccer referees more likely to give red cards to dark-skin-toned players than to light-skin-toned players?” The research teams represented 13 countries, a variety of disciplines, and a range of expertise and academic degrees. Using the same dataset and research question, the 29 teams utilized 29 unique analytical modeling approaches resulting in 21 unique combinations of covariates, 20 teams with significant positive results, and odds ratios ranging from 0.89 to 2.93, as in Table 1. To say the least, analytic choices, even if justifiable and statistically valid, have a downstream effect on model results.
The statistical model development framework can be generally divided into
Figure 1. Researchers process neuroimaging data using a wide variety of pipelines, which can produce varying results. Making different choices for each step leads to a different end point—the red dots represent how activation moves throughout the brain depending on which pipeline is used .
Table 1. From “Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results” .
three phases: data discovery, variable preparation, and modeling. Within each of these phases there are steps in the model development that encompass a wide range of data management, data mining, and data analysis techniques, including data ingestion, sample selection, data cleaning and imputation, feature reduction, feature engineering, normalization, model development, and model validation. Analyzing the downstream effects of modeling approaches within each of these steps will allow for statistically motivated modeling choices in the future. Quantifying the analysis effects of these strategies provides a diagnostic illustration of where researchers can expect to find improvements in their model results.
This study quantifies the downstream analysis effects of data pre-processing choices by utilizing a decomposition of the loss functions, measuring effects on model bias, variance, and irreducible error/random noise. In this way, measurements of bias and variance can be efficiently applied as diagnostic procedures for model pre-processing and development. Applying bias-variance decomposition to a variety of data distributions and model types can lead towards an improved understanding of quantitative variations within model development methods as well as comparing results consistently between methods. Understanding of statistical bias and variance can be used to diagnose problems with machine learning bias and develop methods for reducing bias and variance in algorithms. For example, this bias-variance trade-off does not always behave as expected under distributional assumptions. Even with the availability of more advanced models, such as neural networks, simple models still often perform well, or even better than more complex models, in experiments . Generally, while more complex models result in decreased bias, they tend to increase variance and, therefore, do not generalize well to new data . However, it has been found that ensemble models, although complex, often outperform single models and this seems contradictory to the trade-off between simplicity and accuracy. In this case, decomposition of bias-variance for ensembles led to the understanding that while increased complexity for a single model often increases variance, averaging multiple models will often (but not always) lead to decreased variance . The goal, therefore, of understanding the effects on bias-variance decomposition is to quantify the downstream analysis effects of a selection of model development choices. Using these results as diagnostic procedures can lead to improved model development performance and consistent, reproducible results across data types and domains.
2.1. Quantifying Bias-Variance Trade-Off
We measure the effects of various normalization methods on the bias-variance decomposition of the risk function by directly simulating the definition of the decomposition under varying conditions. We use information found from these simulations to quantify the downstream analysis implications for the predictive models of interest. Since “an important goal in algorithm design is to minimize statistical bias and variance and thereby minimize error ,” we use our findings to propose pre-processing and algorithm design choices that best minimize common design effects on bias and variance. For example, “any change that increases the representational power of an algorithm can reduce its statistical bias. Any change that expands the set of available alternatives for an algorithm or makes them depend on a smaller fraction of the training data can increase the variance of the algorithm .” The result of such a study is to formulate a theory of bias and variance reduction and predict when either or both will succeed in practice.
2.2. Data Normalization
In this context we consider normalization to include data scaling techniques such as normalization and standardization. Data can be scaled so that features measured on different scales can be compared. Probability distributions of features can also be adjusted to be in alignment with each other. Another method is to shift and scale the features (standardization), which removes the units of measure. The primary goal of normalization is to scale each data point in a way that gives equal weight to the features to be used in developing a model. We consider four within-feature normalization methods including standardization, min-max, max absolute value (maxAbs), and quantile transformation, and one between-feature normalization, quantile normalization.
2.2.1. Z-Score, “Standardization”
Standardization is a method that shifts and scales the data to be centered around 0 with a standard deviation of 1:
Characteristics of this method include:
• Assumes data is normally distributed within each feature.
• Centers distribution around 0, with standard deviation 1.
• If data has outliers, scales most of the data to a small interval.
• Does not produce normalized features with the exact same scale.
Even if the data has outliers, Z-score normalization will scale most of the non-outlier data to be in a similar range between all features, assuming the data is normally distributed, as in Figure 2.
2.2.2. Min-Max Normalization
For each feature, the minimum value of that feature is transformed to a 0, the maximum value is transformed to a 1, and every other value lies between 0 and 1:
Advantages of this method include:
• Scales data between 0 and 1; guarantees all features have exact same scale.
• Preserves shape of original distribution.
• Preserves 0 entries in sparse data.
• Least disruptive to information in original data.
However, this method does not reduce the importance of outliers, so skewed results can still exist after normalizing if outliers exist, as in Figure 3.
Figure 2. The data is squished due to outlier but most of the data lies within similar range for both features (  ).
Figure 3. Min-Max Normalization fixes the distribution on the Y-axis but is still problematic on the X-axis due to the outlier (  ).
2.2.3. Max Absolute Value Normalization
This method scales feature by its maximum absolute value so that the maximum absolute value of each feature will be 1. This sets the distribution of each feature between −1 and 1.
Characteristics of this method include:
• Good for data with positive and negative values.
• Preserves 0 entries in sparse data.
• Similar sensitivity to outliers as in min-max normalization.
2.2.4. Quantile Transformation
Quantile transformation transforms each feature independently to follow a uniform or normal distribution. This is a non-linear transformation that uses the estimated value of the cumulative distribution function (CDF) to map a feature original values to a uniform or normal distribution:
1) Calculate empirical ranks, using percentile function.
2) Modify the ranking through interpolation.
3) Map to a Normal distribution by inverting the CDF, and clipping bounds at the extreme values so they don’t go to infinity.
Characteristics of this method include:
• Tends to spread out the most frequent values of a given feature.
• Smooths out unusual distributions.
• Less sensitive to outliers as other scaling methods.
• Distorts the linear correlations between variables measured at the same scale, but variables measured at different scales are more directly comparable.
• For a Normal transformation, the median of the feature becomes the mean, centered at 0.
2.2.5. Quantile Normalization
Quantile normalization is a method most notably used in genetics to normalize within samples, rather than within features as in the previously described methods. In genetic sequencing, data is often normalized based on the assumption of consistent within and between sample distributions, with observed variation around these distributions assumed to be the result of technical noise. Samples are normalized to the same distribution as each other or to a reference gene sample ().
1) Given n arrays of length p, form X of dimension where each array is a column;
2) Sort each column of X to give ;
3) Take the means across rows of and assign this mean to each element in the row to get ;
4) Get by rearranging each column of to have the same ordering as original X.
Characteristics of this method include:
• Makes 2 or more distributions identical in statistical properties.
• Does not preserve original data distributions.
2.3. Model Characteristics
In order to test the effects of various pre-processing methods on select data structures, the data structures and normalizations are considered under a selection of commonly used modeling techniques. The considered models include generalized linear models (GLM), decision tree, random forest, support vector machine (SVM), gradient boosting, and neural network. The selected models represent a range of simple to complex, parametric and non-parametric, global and local, stochastic methods. Each method has characteristics that may require normalization for optimal results or lead to unintended effects if incorrect normalization is used. For example, while GLM are fit using maximum likelihood estimation (MLE) which provides statistically optimal properties of the estimators, scaling data by normalization and standardization is still important because variables with a large difference in ranges can result in an ill-conditioned design matrix and difficulty reaching model convergence, resulting in slower processing times and unstable parameter estimates.
2.3.1. Generalized Linear Models
Historically, Generalized Linear Models (GLM) are an extension of simple linear regression models with continuous targets and continuous and/or categorical features. The form of such a model is expressed as
where is the data in feature i, and are the coefficent parameters to be estimated as part of the linear function. In simple linear regression the assumption is that y is normally distributed, and the errors are normally distributed as and independent, the data is fixed, and there is constant variance . The GLM extends this simple linear model concept by assuming the target variable, , follows a distribution within the exponential family (i.e. normal, binomial, poisson, etc.) with mean . The target then follows some linear or nonlinear function of , the linear combination of data and estimated coefficient parameters . A summary of common GLM is found in Table 2.
Generalized Linear Models are comprised of three main components: Random,
Table 2. Summary of common generalized linear models from Agresti.
Systematic, and Link Function. The random component refers to the distribution of the target variable (Y), e.g. normal distribution in linear regression, or binomial distribution in logistic regression. The systematic component specifies the explanatory features ( ) and their linear combination. The Link Function specifies the link between the random distribution of the target variable and the systematic features. Assumptions of GLMs include:
• Data are independently distributed.
• Errors are independent, but do not need to be normally distributed (i.e. Logistic Regression).
• Dependent variable does not need to be normally distributed (expect in linear regression) but are distributed within the exponential family.
• Assumes a linear relationship between the link function transformed target and the explanatory features.
• Uses Maximum Likelihood Estimation (MLE) to estimate the parameters, so it relies on large sample properties and regularity conditions (1st and 2nd derivatives must exist).
GLMs use Maximum Likelihood Estimation (MLE) to estimate the model parameters. In each of the distributions considered above (i.e. Linear, Logistic, Poisson, etc.), the distribution depends on one or more unknown parameters, . The value of these parameters, , is estimated using observed data x. The function of that results from plugging in observed data x is known as the Likelihood Function:
This function is the product of the values of the parameters, given each sample of data, and is denoted simply as . The log-likelihood is often used for computational convenience. The goal in GLMs is to maximize the likelihood of a parameter estimate given the observed data. The value of that maximizes this function is known as , the maximum-likelihood estimate (MLE). The maximum of the function is found by taking the derivatives with respects to the parameter(s) .
In this work, three generalized linear models are considered for three distinct target data types: linear regression, logistic regression, and Poisson regression.
2.3.2. Linear Regression
Linear regression is used for data with a continuous target which is a linear combination of the explanatory features, as in
where index i represents each data point. This models the mean expected value of Y. The random component of linear regression, Y, has a normal distribution and normally distributed errors, . The systematic component, the explanatory features X, can be continuous, categorical, or a combination of both, and is linear in the parameters . In multiple linear regression with multiple explanatory features, there is still a linear combination of the features in terms of their coefficient parameters ’s but the features themselves can have transformations, i.e. or . The link function is the identity link, since linear regression is modeling the mean response directly.
2.3.3. Logistic Regression
When there is a binary target (i.e. 0 and 1) binary logistic regression models the log odds of probability of “success” (target = 1). The random component, Y, has a binomial distribution, , where is the probability of success. The systematic component, X, can be continuous, categorical, or a combination of both, and is also linear in the parameters as in linear regression. However, in
this case, the link function is the Logit link, . Specifically, the logit link models the log odds of the mean response, .
2.3.4. Poisson Regression
When the target of interest is an expected count (i.e. counts of disease, number of homes sold in a day, etc.), we extend the generalized linear model to use a log-linear or Poisson regression model. This models the expected count as a function of the explanatory predictors, , where the predictors can be continuous, categorical, or a combination of both. When all the predictors are categorical this is known as a log-linear model. The random component of the Poisson model is the response Y with Poisson distribution, for where expected count of is . The systematic component is, as in the other GLM models, the linear combination of explanatory features X. Finally, the link function for the Poisson regression model is the natural log link, .
Advantages of GLM
• Do not need to transform target variable to have normal distribution.
• Models fit using MLE which provides statistically optimal properties of the estimators.
• Model can be easily explained and parameters can be interpreted in the context of the prediction problem.
• Easily implemented in most software.
Disadvantages of GLM
• Still has to be a linear function of the parameters; the link function serves only to connect the nonlinear target distribution to a linear function.
• Target responses must be independent.
2.3.5. Decision Tree
A decision tree is a non-parametric classification technique that learns decision rules from features, using locally optimized, recursive partitioning. The algorithm assigns each sample in a dataset into a predicted class based on each samples’ feature attributes. The algorithm uses information gain (7) to find the best features for classifying the data, where p and n are the proportion of 0 and 1 values of a binary outcomes for the i-th target class. Then, for each value defined for the decision values of the best feature (the feature and splitting value that best splits the predicted 0 and 1 outcomes), the algorithm repeats the process with additional, next-best predictive features. This process continues until the leaves of the tree are pure (samples at each node belong to the same class) or a pre-defined stopping criteria is reached . In this way, decision tree is also a feature importance algorithm, where the data will be split on the most important, predictive features first.
• Since the decision tree algorithm is based on ordering and splitting the values within each feature, rather than a scale-dependent maximum likelihood optimization, scaling and normalizing features is not required.
• Robust to missing data.
• This model provides visual splits of the data and ordered feature importance that is easy to understand and interpret.
• Implicit variable screening and selection, the top nodes of the tree are the most important variables in the dataset.
• Non-parametric model does not assume linearity or any other distribution of the data. Model is built only based on observed data.
• Since this is a locally optimized, greedy algorithm, it is not guaranteed that a global optimum will be reached.
• Decision tree is very sensitive to changes in data. Small changes in data (i.e. adding samples) can lead to large structural changes in the tree, i.e. high variance.
• This is a more complex model and often requires more training time.
• Without regularization (early stopping, pruning, max nodes, etc.), there is high risk of overfitting.
2.3.6. Random Forest
Random forest is a method that uses ensemble learning to address some of the disadvantages of the decision tree model. Ensemble learning combines results from multiple models to make more accurate predictions than any one single model, by reducing variance. Random forest uses an ensemble learning technique known as bootstrap aggregation, aka bagging. Bagging uses random sampling with replacement to build individual models on subsets of the available data and then aggregate the results into one prediction. The repeated sampling leads to an algorithm that is known to reduce variance, as in one of the main disadvantages of the decision tree model. Random forest combines many decision trees into one model by running the individual decision tree models in parallel and then outputting the prediction that is the mode of target classes for a classification problem or the mean prediction for a regression problem . The structure of a random forest model is shown in Figure 4.
• Much like decision tree, gives estimates of most important features.
• Known for high accuracy, low bias.
• Decreased variance in comparison to decision tree.
• Can handle large datasets with high dimensionality.
• Since it identifies most important features, can be used as a feature reduction method.
• Robust to missing data.
• Use of bootstrap sampling allows for successful application when data is limited.
• When classifying categorical data, biased in favor of features with more levels.
• Will overfit data if regularization not used, such as limiting number of features that can be split at each node.
• More difficult to interpret than single decision tree model.
2.3.7. Support Vector Machines (SVM)
SVM with Gaussian kernel is a parametric model that represents instances of data as points in space and then builds a model to assign new instances to one category or another. Each data point is represented as an n-dimensional vector, then SVM constructs an n-1-dimensional separating hyperplane to discriminate 2 classes, with maximized distance between the hyperplane and data points on each side. SVM aims to find the best hyperplane for separation of both classes . Data are represented as:
where is either 1 or −1, indicating to which class belongs. Each is p-dimensional vector representing all of the characteristic values (features) of . The hyperplane that best separates the group of vectors where from the group of vectors where is:
where is the normal vector to the hyperplane and b is the offset of the hyperplane from the origin. If the data points are linearly separable, the hard margin can be represented as
Figure 5 shows a maximum margin separation for linearly separable data. The samples that fall on the margin are known as the support vectors.
Figure 4. Random forest structure .
Figure 5. Maximum margin hyperplane .
The SVM algorithm assumes that data is in a standard range (usually between 0 to 1, or −1 to 1), so it is recommended to scale features before using the algorithm. In fact, when using the Gaussian kernel, if data is normalized between 0 and 1, then the dot product between the feature vectors and the separating hyperplane is the cosine similarity .
• If there is clear separation of the data classes, SVM works very well.
• Effective in high-dimensional data, especially when the number of features is similar or greater than the number of samples.
• Since the samples that make up the support vectors are the only training data used to define the model, SVM is memory efficient.
• Since this model has to calculate the distance between every training point to create a separating hyperplane, it is computationally expensive as the size of the data set increases.
• Noisy data with overlapping target classes are difficult to separate; Kernel functions can be added to transform the data into higher level feature space for improved separation but this adds model complexity.
• Does not directly provide parameter coefficients so it is difficult to interpret.
2.3.8. Gradient Boosting
Gradient boosting is another form of ensemble learning, this time utilizing a technique known as boosting. In a boosting algorithm, predictions are not made in parallel as in the bagging method of random forest. In this case, subsequent prediction models learn from the mistakes of previous models. Observations have an unequal probability of appearing in the subsequent models, with high error observations appearing in the most models. This is contrary to the random forest model where observations are selected for each model via bootstrapping (random selection with replacement) and have equal probability of appearing in each model. Visual comparison of single, bagging, and boosting models is show in Figure 6.
In gradient boosting, an ensemble of weak models, often decision trees, are used to improve the model based off of hard to predict samples. The algorithm leverages patterns in model residuals, such as those from using MSE loss, to build subsequent models from the weak predictions. For example, in a simple linear regression there is the assumption that the sum of the residuals is 0, i.e. spread randomly with no pattern around zero. However, assuming there is some pattern in the residuals for a base model, such as a decision tree, gradient boosting builds sequential models off of these residual patterns until there is no longer a pattern, i.e. average residual is zero or constant. The sequential model predictions are then weighted into a combined prediction. The intuitive idea behind gradient boosting is to combine several weak models, with each additional weak model improving the MSE of the overall model. Advantages of bagging and boosting ensemble techniques are illustrated in Figure 7.
• Focus on difficult to classify cases makes it robust to imbalanced datasets.
• MSE is commonly used loss function, but gradient boosting can be optimized on many objective functions so it can be extended to many different problem spaces.
Figure 6. Bagging (independent models) and boosting (sequential models) .
Figure 7. Ensembling .
• Requires more hyperparameter tuning, and training time to avoid overfitting compared with random forest.
• Sensitive to overfitting if data is noisy, i.e. many hard to classify cases to use in the sequential models.
• Longer training requirements due to sequential nature of algorithm (as opposed to parallel model development in random forest).
2.3.9. Neural Network
In this study, the effects of normalization on various data types are also tested using a multi-layer perception (MLP), also known as the simple form of a neural network. Neural networks are models that learn non-linear function approximations by feeding a set of input features into an output. Although the input and output layers are similar to the linear approximations of generalized linear models, neural networks differ in that there is one or more non-linear hidden layers, as in Figure 8 with one hidden layer.
The first layer, the input layer, contains a set of neurons representing the m input features. The inputs are fed into the hidden layer first with a weighted linear combination, similar to the linear combination of features
Figure 8. One hidden layer neural network.
and βs in a GLM. The combined inputs are then transformed by a non-linear activation, such as a tan function. From the last hidden layer, the input layer then applies an activation function to transform the values into outputs, such as the sigmoid function for a binary classification problem. The weights within each layer of the neural network are learned through a process of backpropagation and gradient descent. This process uses derivatives with respect to each parameter to find the optimal value of the selected loss function. Even though neural network uses non-linear transformations in the hidden layers, the network still uses linear combinations of the features and weights to learn the optimal parameters, as in the GLM, linear-based methods.
• Can learn complex, non-linear models.
• Works well with “big data”; feeding neural networks more data leads to improved training and results.
• Ability to detect all possible interactions between predictor variable.
• MLP with hidden layers have a non-convex loss function where there exists more than one local minimum; different random weight initializations can lead to different validation accuracy.
• Sensitive to feature scaling, due to above disadvantage.
• Requires a lot of tuning (number of hidden neurons, layers, iterations), and regularization to prevent overfitting.
• Requires a lot of data for best training and results.
• Difficult, computationally expensive to train.
• “Black box” algorithm is difficult or not possible to interpret.
2.4. Model Summary
A global model is one in which there is a single predictive formula for the entire data space. It is expected that a linear transformation of data in a linear-based global model (linear regression, logistic regression, Poisson regression, linear SVM) will result in the model parameters (i.e. weights in a neural network, coefficients in regression) adjusting to reach the optimal value of the risk function, such as using the MLE in the GLM class of models. As a result, we expect that choice of normalization method should not affect the risk function value as long as the feature space is a convex function, but it can affect the values and stability of the feature coefficients. In this case, even though normalization may not affect the estimated total average error, it may have effects on the estimated average bias and variance due to model instability.
For non-linear, locally recursive models such as decision tree, random forest, and gradient boosting regression, it is also expected that within-feature global normalization will have little effect on risk function value. These tree-based models optimize by finding the best split-point within each individual feature by the percentage of labels correctly classified using that feature. Since these models are local, recursive models, as long as the ordering within the features is preserved, normalization of the data should not affect the loss function value. However, although we’re using a decision tree-based learning model for the gradient boosting regression, this type of sequential boosting model relies on minimizing the MSE for the global model through subsequent predictions on the individual model residuals. Because of this, it is suspected that the gradient boosting model will exhibit patterns in bias-variance decomposition similar to the linear models. However, since we are using the default hyper-parameters in the gradient boosting model for consistent simulation conditions, it is possible that the bias-variance decomposition results will suffer from over-fitting and have a longer training time.
2.5. Simulation Methods
In order to approximate the bias-variance decomposition we need to approximate the expected value by simulating many variants of the training data sets. We can do this via bootstrap sampling. We take a synthetic input dataset D and create variants of D from of size n (Algorithm 1).
Now for each example we have many predictions and can estimate:
• variance: variance of
bootstrap replicate datasets with 70% training samples and 30% out-of-bag testing samples were selected from simulated bivariate normal data with samples. The simulated features have different means and standard deviations, and an identity covariance matrix. The true target value, Y, was created as a simple linear function of the simulated features plus a random error
Algorithm 1. Bootstrap Sampling.
term. In addition, datasets with binary (logistic) target and continuous target were created for each simulated data distribution. The models described previously, representing varying complexity, were applied using the training data and the bias-variance decomposition of the MSE risk computed on the test set, with total average risk, bias, and variance calculated over all 1000 bootstrap replicates. This process was repeated on the simulated data using the previously described normalization techniques. Initial dataset simulations were completed in R and risk function decompositions with bootstrapping completed in Python. This process was then repeated on additional simulated datasets including rank-based data, categorical data, mixed data, and Poisson data. Default hyperparameters were used for all tested models and data structures, with no additional hyperparameter tuning to allow for consistent comparison between methods. MSE risk decomposition was then performed on several benchmark datasets to assess results on various data characteristics including sparse data, wide data (more features than samples), and imbalanced data. These benchmark datasets were from the UCI Machine Learning Library  and are listed in Table 3. The results of bias-variance decomposition are then used to inform model develop on several existing study applications. Selected applications include the historical data from the NCAA Men’s Basketball Tournament (historical data used due to cancellation of NCAA tournament in 2020; comparing to model results from last years competition), and a credit risk model .
3. Simulation Results
Results are divided into three sections describing results from 1) bias-variance decomposition simulation, 2) bias-variance decomposition on benchmark datasets, and 3) application of findings from Sections 1 and 2 to existing NCAA data and credit risk data. The first section on simulation results is divided by target data type (binary, continuous, Poisson). Results across data structures and models is relatively consistent so summaries were provided to avoid repetition. Complete results of bias-variance decomposition under various data structures,
Table 3. Benchmark datasets and charactertics.
normalization strategies, and models are shown in Appendices A and B. Performance measures for simulated model results are found in Table 4.
3.1. Binary Target
Over all models applied to bivariate normal data with binary target, SVM using within-feature normalization methods (risk = 0.290), and logistic regression with raw data or quantile normalization (risk = 0.290) have best, similar risk function results (TableA1). While SVM has the consistently best results and is normalization-agnostic (FigureB1(d)), logistic regression (FigureB1(a)) with raw data or quantile normalization has similar performance with a faster run time (approximately 5 seconds vs 18 seconds as in Table4). If an analyst would like to use decision tree, random forest, or neural network models instead, it is recommended to use raw data or quantile normalization for best risk function results.
For models applied to rank-based data with binary target, logistic regression using raw data or quantile normalization (risk = 0.444), and SVM with raw data or quantile normalization (risk = 0.437) have best, similar risk function results (TableA4). While SVM has the consistently best results (14d), logistic regression with raw data or quantile normalization (FigureB4(a)) has similar performance with a faster run time (approximately 6 seconds vs 39 seconds as in Table4). If an analyst would like to use decision tree (best risk = 0.489), random forest (best risk = 0.445), gradient boosting (best risk = 0.465), or neural network (best risk = 0.475) models instead, it is recommended to use raw data or quantile normalization for best risk function results.
Over all normalization methods and models applied to categorical data with binary target, risk function estimates ranged from 0.473 to 0.502, indicating that this data is somewhat model- and normalization-agnostic (TableA7). SVM
Table 4. Average model performance (in seconds of processing time) for bias-variance decomposition of simulated data structures.
using z-standardized data, and logistic regression with all methods except z-standardization resulted in best risk function value of 0.473 (FigureB7(d)). However, while SVM and logistic regression have similar results, logistic regression (FigureB7(a)) is more than 3 times faster (7 seconds vs. 21 seconds average processing time as in Table 4). Since the simulated dataset consists of all categorical data, the features are first converted to [0, 1] coded dummy features, effectively “normalizing” the data between 0 and 1, so additional normalization methods are not expected to have an effect on the downstream analysis.
For mixed data types with a binary target, although there are slight deviations between normalization and model performance, risk function values do not vary much between all methods, with a range between 0.479 and 0.507 (TableA10). A decision tree model using raw data leads to the best results (FigureB10(b)), and gradient boosting using raw or quantile normalized data leads to the worst results (FigureB10(e)). However, considering the consistency of performance across normalization methods and models, it is recommended to make selections based on additional criteria, such as processing resources, model interpretation, or another performance measure such as specificity and sensitivity.
3.2. Continuous Target
Over all models applied to bivariate normal data with continuous target, neural network using raw data (risk = 0.342), and linear regression with raw data (risk = 0.338) have best, similar risk function results (TableA2). Quantile normalization method for both models has similar results with MSE risk of 0.344 for linear regression (FigureB2(a)) and 0.356 for neural network (FigureB2(f)). However, while neural network and linear regression have similar results, linear regression is approximately 20 times faster (1.2 seconds vs. 26 seconds average processing time as in Table4). SVM has the most consistent results; even though the within-feature normalization methods all perform worse than raw or quantile normalized data, SVM within-feature normalization results perform better than the same normalization in all other tested models. If normalization and scaling of data is required, as with features measured on highly divergent scales, it is recommended for an analyst to test the SVM model, keeping in mind increased processing requirements.
When applied to rank-based data with continuous target, neural network using raw and quantile normalized data (risk = 0.175), and linear regression with raw and quantile normalized data (risk = 0.174) have best, similar risk function results (TableA5). However, while neural network and linear regression have similar results, linear regression is more than 17 times faster (0.8 seconds vs. 14 seconds average processing time as in Table4). SVM has the most consistent results (FigureB5(d)); even though the within-feature normalization methods all perform worse than raw or quantile normalized data, SVM within-feature normalization results perform better than the same normalization in all other tested models. If normalization and scaling of data is required, as with features measured on highly divergent scales, it is recommended for an analyst to test the SVM model, keeping in mind increased processing requirements. Note, however, that outside of the best-performing linear regression and neural network models, all other methods and models perform significantly worse due to exploding estimates of average bias.
While normalization generally does not improve or worsen results as compared with raw data, it is not recommended to use z-standardization for categorical data with a continuous target, as the simulation results indicate increased risk function values (TableA8). In particular, if using linear regression (FigureB8(a)), z-standardization and quantile transformation should be avoided as these methods used with this model lead to significant explosion in the risk function value. Outside of z-standardization for any tested model, and quantile transformation for linear regression, simulation results indicate that this type of data is both normalization- and model-agnostic. In this case, normalization and model selection can be based off of additional criteria, such as processing requirements or model transparency.
Over all models applied to mixed data with continuous target, neural network using raw data (risk = 0.366) and quantile normalized data (risk = 0.425), and linear regression using raw data (risk = 0.367) and quantile normalized data (risk = 0.363) have best, similar risk function results (TableA11). However, while neural network (FigureB11(f)) and linear regression (FigureB11(a)) have similar results, linear regression is approximately 41 times faster (1.6 seconds vs. 66 seconds average processing time as in Table4). SVM has the most consistent results; even though the within-feature normalization methods all perform worse than raw or quantile normalized data, SVM within-feature normalization results perform better than the same normalization in all other tested models. If normalization and scaling of data is required, as with features measured on highly divergent scales, it is recommended for an analyst to test the SVM model, keeping in mind increased processing requirements and potential for increased bias.
3.3. Poisson Target
Over all models applied to bivariate normal data with Poisson target, random forest using within-feature normalization methods (risk = 1.125), and gradient boosting using within-feature normalization methods (risk = 1.167) have best, similar risk function results (TableA3). Poisson regression using raw or quantile normalized data also has strong results with risk function value of 1.5. Although Poisson regression (FigureB3(a)) results are not as strong as those found using random forest (FigureB3(c)) and gradient boosting (FigureB3(e)), Poisson regression has processing time more than 200 times faster than random forest (0.8 seconds vs. 167 seconds average processing time) and 13 times faster than gradient boosting (0.8 vs. 167 seconds average processing time) as seen in Table4.
For rank-based data with Poisson target, SVM using raw and quantile normalized data (risk = 1.281), and Poisson regression with raw and quantile normalized data (risk = 1.280) have best, similar risk function results (TableA6). However, while SVM and Poisson regression have similar results, Poisson regression is more than 6 times faster (6 seconds vs. 36.6 seconds average processing time), as seen in Table4. Although Poisson regression has the best results for this data, use of the within-feature normalization methods result in unstable, exploding risk function values and should be avoided (FigureB6(a)). Within-feature normalizations should also be avoided if using a neural network on this data for the same reason. If normalization and scaling of data is required, random forest model has the best results for the within-feature methods, although it has the longest processing time.
Although there are slight deviations between normalization and model performance, risk function values do not vary much between all methods when considering categorical data with Poisson target, with a range between 1.499 and 1.783 (TableA9). A decision tree model using min-max, maxAbs, quantile transformation, or quantile normalization lead to the best results (FigureB9(b)), and SVM using z-standardization leads to the worst result (FigureB9(d)). However, considering the consistency of performance across normalization methods and models, it is recommended to make selections based on additional criteria, such as client requested models.
Over all models applied to mixed data with Poisson target, gradient boosting (FigureB12(e)) using raw and quantile normalized data (risk = 1.746), and random forest (FigureB12(c)) using raw data (risk = 1.820) have best results (TableA12). Generally, the best performing risk function values do not vary much between all tested models, with a range between 1.476 for the gradient boosting model and 2.087 for the decision tree model (Table4). The similarity in best performing model results indicates that this data type is somewhat model-agnostic, although normalization methods should be selected carefully if required for analysis. For example, it is not recommended to use any of the tested within-feature normalizations if a neural network is used due to significant increases in bias and variance found in the simulation results.
4. Benchmark Data Results
Benchmark datasets were selected from the UCI Machine Learning Library to cover data types similar to those covered in the simulations. Complete tables of risk function decomposition results are found in Appendix Section A.2, and figures are found in Appendix Section B.2. Binary target datasets with numeric features (wine quality, breast cancer), and categorical features (congressional voting records), have bootstrapped bias-variance decomposition results consistent with those found in the simulated datasets with the same data structure characteristics (see Figures B13-B15). The traditional within-feature normalization methods (z-standardization, min-max, maxAbs, quantile transformation) result in risk function values that are the same or worse than using raw data or quantile normalization. For the wine quality data, using raw or quantile normalized data in logistic regression, linear SVM, or neural network results in best risk function performance, while quantile normalization with logistic regression was best for the breast cancer data. For the congressional voting records data, logistic regression and neural network with raw and quantile normalized data were also found to be the best method-model combinations, consistent with simulated data results. However, it is interesting to note that z-standardization in both of these models resulted in the worst risk function performance among all other method-model combinations applied to this dataset, due to both increased bias and variance.
For the binary target data with mixed data type features (arrhythmia, abalone), raw data and quantile normalization also lead to the best risk function performance. However, the arrhythmia dataset is a relatively more complex dataset in comparison to the others tested. It has missing data, many features in comparison to few instances (i.e. 279 features vs. 452 instances), and imbalanced target data. As a result, logistic regression is not well suited for describing these complex relationships, and has worse risk function performance; decision tree and gradient boosting regression with raw data or quantile normalization have best results (FigureB17). In contrast, while the abalone dataset is also an imbalanced dataset, it has less complex data structures with only 8 features and over 4000 instances. In this case, logistic regression, linear support vector machine, and neural network are well-suited for the less complex data, and result in improved risk function values due to decreased variance in comparison to the more complex models (FigureB16).
For assessing results on continuous target data, the forest fires dataset (numeric features), solar flare dataset (categorical features), and auto MPG dataset (mixed type features) were considered. Once again, in all cases, the within-feature normalization performed the same or worse than using raw data, with the between-feature quantile normalization process being the only method that resulted in same or some improvement to risk function values. For both numeric and categorical only datasets (forest fires and solar flare, respectively), the linear-based logistic regression and neural network models with raw data or quantile normalization resulted in best performance (see FigureB18 and FigureB19), with all other normalization-model combinations resulting in both increased bias and variance. In the case of the more complex mixed feature type auto MPG dataset, gradient boosting regression also has improved performance, but logistic regression has similar performance and is a faster algorithm (FigureB20).
Findings of the empirical study and benchmark data were applied to existing studies, using best-performing normalization-model combinations in the model development process.
5.1. NCAA Tournament Data
For the 2019 NCAA Men’s Basketball Tournament Bracket prediction problem, data was used from over 100,000 NCAA regular season games, with the goal to take information about two teams as input, and output a probability of team 1 winning a game. Motivated by the popular Kaggle competition, models were developed to minimize log-loss between predicted win probabilities and actual game outcomes, as in:
This loss function has high penalty for models that are both confident and wrong (  ). Model development involved:
• Readily available game statistics, provided by Kaggle.
• Commonly used external ratings systems (Massey Ratings).
• No additional feature engineering.
• No domain knowledge.
The analysis value-added in comparison to previous research and public models found on Kaggle was to focus on the comparison of various normalization techniques for model development. In particular, the use of outside domain knowledge (public health, genetics) to apply a technique from one domain (genetic research) to an unrelated domain (sports data) proved advantageous but was not, initially, statistically motivated. However, through simulation of bias-variance decomposition and findings from application to benchmark data, it is expected that improved loss function performance for this type of data (ranked data, balanced target, non-missing data) can be achieved by using a linear-based model with raw or quantile normalized data. Rather than iterating through many normalization-model combinations (which took over 12 hours of computation time when building the model for the 2019 tournament), logistic and linear SVM models with raw and quantile normalized data were trained on NCAA regular season data from 2014-2017 and tested on the 2018 tournament. The same features were used as in the 2019 model, with only the model and normalization selection process updated based on the model development framework findings. From the simulation results, logistic regression and SVM for both raw and quantile normalization provided similar results, although SVM outperformed logistic somewhat due to decreased variance, although it has increased bias. In the updated NCAA application on 2018 data, logistic regression with raw data outperformed the other tested models, with a log-loss score of 0.569 (Figure 9). This log-loss score, in comparison to other Kaggle submissions in the 2018 tournament, would have ranked 23rd out of 933 teams (98 percentile) and required only the original data supplied by Kaggle and no additional feature engineering or model tuning. In addition, the entire model development process and testing took less than 30 minutes. Three out of the four models developed correctly predicted the Final Four including the tournament Champion, Villanova. This is compared with a 2019 bracket that, while it correctly predicted the tournament winner, only predicted two of the Final Four teams, and scored in the 90th percentile of Kaggle Log-loss scores.
5.2. Credit Risk Data
Previous research by Rudd and Priestley (rudd2017comparison) compare the use of logistic regression and decision trees for prediction of commercial credit risk. The dataset, provided by Equifax, included over 11 million records and over 300 features, and involved extensive data preprocesing including imputation, feature reduction, and transformation. The effects of normalization were not considered at the time. Based on findings from simulations and benchmark results, it was found that gradient boosting regression with raw and quantile normalization should also be considered for this type of data. Running the analysis again, this time including gradient boosting regression, found best results for gradient boosting with raw data (AUC = 0.96). A drawback, however, of this result in the context of credit risk analysis is that gradient boosting is much more
Figure 9. 2018 men’s basketball march madness bracket developed using winning model, logistic regression with raw data.
difficult to explain than the logistic regression and decision tree models, and can be problematic in a heavily regulated industry where model interpretability is required (Figure 10).
6. Conclusions and Suggestions
In this study, simulation was used to investigate the effects of normalization on downstream analysis results. Normalization methods were investigated by utilizing a decomposition of the empirical risk functions, measuring effects on model bias, variance, and irreducible error. Estimates of bias and variance were then used as diagnostic procedures for data pre-processing and model development. We used our findings to propose model development and algorithm design choices that best minimize common design effects on bias and variance.
Figure 10. AUC curves for selected model-normalization combinations tested based on model framework results.
Mean square error (MSE) was considered as the evaluation metric, and the effects of a selection of normalization methods were measured on the empirical risk function. Normalization techniques were selected that represent both data invariant as well as data variant normalization strategies. For example, techniques such as z-score standardization (transforms data to have a mean of zero and a standard deviation of 1) and feature scaling (rescaling data to have values between 0 and 1) change the spread and position of data points (all by consistent factors) but do not change the distribution shape of the data, whereas techniques such as quantile normalization, commonly used in genetic differential expression analysis, affect the measures of spread, position, and shape. Through simulation of various data structures and bootstrap sampling of the bias-variance decomposition, best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. For example, it was found that for rank-based data with binary target, quantile normalization performed better than the data invariant methods with similar or improved performance over raw data due to decreased variance in the loss function value. In addition, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities, such as wide data (large feature to instance ratio) as in the arrhythmia dataset. Finally, applying the findings from simulation experiments to previous applications led to equivalent or improved results with less model development overhead and processing time. Applying the model framework to the 2018 NCAA Men’s Basketball data resulted in a log-loss score that would have been ranked 23 out of 933 teams (98th percentile) and only required 30 minutes of model overhead, as opposed to a 2019 model that required over 12 hours of processing and resulted in a 90th percentile log-loss score.
While this work provides a statistical illustration of the downstream effects of model development choices, it represents a baseline for further research in this area. For example, while the bias-variance decomposition simulations described in this study illustrate that model and normalization method selection do affect downstream results, they are only suggestive of theoretical properties of these specific methods that should be further explored. Also, a researcher’s primary modeling goal (i.e. predictive accuracy vs. explanatory model) will determine both appropriate model and pre-processing technique selection. In addition, the main goal of normalization is to put features on comparable scale for improved model fitting, performance, and interpretability. Considering normalization as a model selection procedure and selecting based on minimized risk function value can potentially lead to overfitting. Finally, this study considers a limited selection of models and model performance measures, while assuming all other proposed aspects of the model development framework are held constant. A more exhaustive study of performance assessments should be considered to better establish the downstream analysis effects of statistical procedures, including coverage probabilities, misclassification rates, sensitivity/specificity, etc. In this study, we selected MSE due to the ability to generalize the risk functions across multiple data types and model applications. In addition, these assessments need to include additional consideration on various combinations of model development strategies within the model development process, i.e. sample selection, feature engineering, model validation, etc.
6.2. Future Research
In this research, an empirical study was used to illustrate the downstream analysis effects of model development choices. Further theoretical work is recommended to connect the empirical findings theoretical properties. For example, since the generalized linear models were most often found to have best risk function results regardless of data structure or selected normalization, additional theoretical study can enhance the explanation of these results. If we assume that MLE is unbiased in GLM then estimates of bias should resolve towards zero. In this case, bias-variance decomposition of the loss will be completely defined by variance. As an extension, the Cramer-Rao lower bound property of MLE (the lower bound of the variance of the estimator) suggests that no other model will achieve a better result of the bias-variance loss decomposition. This potential explanation requires theoretical understanding of at least two questions: 1) Are generalized linear models, in fact, unbiased? and 2) Does the Cramer-Rao lower bound theorem apply to variance of the prediction and not just the parameters? In addition, estimates of predictive risk are only one way to assess performance of a statistical procedure and downstream analysis effects. In order to provide further justification connecting the empirical evidence with the proposed framework, it is recommended to consider additional measures of performance such as coverage of predictive intervals, class probabilities cutoff points, gain and lift charts, etc. The diversity of analytic choices and resulting modeling pipelines leads to a wide range of potential future research required to truly quantify the complete effects of a unified model development framework.
A.1.1. Bivariate Normal Data with Binary Target
Table A1. Bias-variance decomposition results for bivariate normal data with binary target.
A.1.2. Bivariate Normal Data with Continuous Target
Table A2. Bias-Variance decomposition results for bivariate normal data with continuous target.
A.1.3. Bivariate Normal Data with Poisson Target
Table A3. Bias-variance decomposition results for bivariate normal data with poisson target.
A.1.4. Ranked Data with Binary Target
Table A4. Bias-variance decomposition results for ranked data with binary target.
A.1.5. Ranked Data with Continuous Target
Table A5. Bias-variance decomposition results for ranked data with continuous target.
A.1.6. Ranked Data with Poisson Target
Table A6. Bias-variance decomposition results for ranked data with poisson target.
A.1.7. Categorical Data with Binary Target
Table A7. Bias-variance decomposition results for categorical data with binary target.
A.1.8. Categorical Data with Continuous Target
Table A8. Bias-variance decomposition results for categorical data with continuous target.
A.1.9. Categorical Data with Poisson Target
Table A9. Bias-variance decomposition results for categorical data with poisson target.
A.1.10. Mixed Data with Binary Target
Table A10. Bias-variance decomposition results for mixed data with binary target.
A.1.11. Mixed Data with Continuous Target
Table A11. Bias-variance decomposition results for mixed data with continuous target.
A.1.12. Mixed Data with Poisson Target
Table A12. Bias-variance decomposition results for mixed data with poisson target.
A.2. Benchmark Results
A.2.1. Wine Quality Data with Binary Target
Table A13. Bias-variance decomposition results for wine quality data with binary target.
A.2.2. Breast Cancer Data with Binary Target
Table A14. Bias-variance decomposition results for breast cancer data with binary target.
A.2.3. Voting Data with Binary Target
Table A15. Bias-variance decomposition results for congressional voting data with binary.
A.2.4. Abalone Data with Binary Target
Table A16. Bias-variance decomposition results for abalone data with binary target.
A.2.5. Arrhythmia Data with Binary Target
Table A17. Bias-variance decomposition results for arrhythmia data with binary target.
A.2.6. Forest Fires Data with Continuous Target
Table A18. Bias-variance decomposition results for forest fires data with continuous target.
A.2.7. Solar Flares Data with Continuous Target
Table A19. Bias-variance decomposition results for solar flares data with continuous target.
A.2.8. Auto MPG Data with Continuous Target
Table A20. Bias-variance decomposition results for auto mpg data with continuous target.
B.1.1. Bivariate Normal Data with Binary Target
Figure B1. Bias-variance decomposition for bivariate normal data with binary target.
B.1.2. Bivariate Normal Data with Continuous Target
Figure B2. Bias-variance decomposition for bivariate normal data with continuous target.
B.1.3. Bivariate Normal Data with Poisson Target
Figure B3. Bias-variance decomposition for bivariate normal data with poisson target.
B.1.4. Ranked Data with Binary Target
Figure B4. Bias-variance decomposition for ranked data with binary target.
B.1.5. Ranked Data with Continuous Target
Figure B5. Bias-variance decomposition for ranked data with continuous target.
B.1.6. Ranked Data with Poisson Target
Figure B6. Bias-variance decomposition for ranked data with poisson target.
B.1.7. Categorical Data with Binary Target
Figure B7. Bias-variance decomposition for categorical data with binary target.
B.1.8. Categorical Data with Continuous Target
Figure B8. Bias-variance decomposition for categorical data with continuous target.
B.1.9. Categorical Data with Poisson Target
Figure B9. Bias-variance decomposition for categorical data with poisson target.
B.1.10. Mixed Data with Binary Target
Figure B10. Bias-variance decomposition for mixed data with binary target.
B.1.11. Mixed Data with Continuous Target
Figure B11. Bias-variance decomposition for mixed data with continuous target.
B.1.12. Mixed Data with Poisson Target
Figure B12. Bias-variance decomposition for mixed data with poisson target.
B.2. Benchmark Data Results
B.2.1. Wine Quality Data
Figure B13. Bias-variance decomposition for wine quality data with binary target.
B.2.2. Breast Cancer Data
Figure B14. Bias-variance decomposition for breast cancer data with binary target.
B.2.3. Voting Data
Figure B15. Bias-variance decomposition for congressional voting data with binary target.
B.2.4. Abalone Data
Figure B16. Bias-variance decomposition for abalone data with binary target.
B.2.5. Arrhythmia Data
Figure B17. Bias-variance decomposition for arrhythmia data with binary target.
B.2.6. Forest Fires Data
Figure B18. Bias-variance decomposition for forest fires data with continuous target.
B.2.7. Solar Flares Data
Figure B19. Bias-variance decomposition for solar flares data with continuous target.
B.2.8. Auto MPG Data
Figure B20. Bias-variance decomposition for auto MPG data with continuous target.
 Silberzahn, R., et al. (2018) Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science, 1, 337-356.
 Dietterich, T.G. and Kong, E.B. (1995) Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms. Technical Report, Department of Computer Science, Oregon State University, Corvallis.
 Evans, C., Hardin, J. and Stoebel, D.M. (2017) Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions. Briefings in Bioinformatics, 19, 776-792.
 Rudd, J.M. (2018) Application of Support Vector Machine Modeling and Graph Theory Metrics for Disease Classification. Model Assisted Statistics and Applications, 13, 341-349.
 Forman, G., Scholz, M. and Rajaram, S. (2009) Feature Shaping for Linear SVM Classifiers. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 299-308.