rcentages were ordered from lowest to highest, and a feasible number of 120 predictors was chosen from the top of the list.
Figure 1. The matrix plot of the error term variability in percent of the total. The low values are represented in the plot with darker colors.
Model 1: Logistic Regression with AIC Selection-Gender: After the selection procedure, the resulting model possesses 40 predictors. The AIC model is capable of successfully classifying 384 females as females, and 318 males as males. On the other side, it incorrectly classified 47 females as males, and 70 males as females. Based on these results, the model performed predictions with 85.71 percent of accuracy. We then validated how the model performed in 10-fold cross-validation.
While the model without cross-validation and optimal threshold was able to predict with an accuracy rate of 86.20%, the cross-validated model got an 85.59% accuracy rate.
Model 2: Logistic Regression with Ranked Predictors-Gender: The model obtained using logistic regression contains 29 variables with p-values lower than 0.05 for statistical significance, intercept included. The model performed predictions with an 85.71 percent of accuracy. While the model without cross-validation and optimal threshold was able to predict with an accuracy rate of 88.64%, the cross-validated model got an 87.67% accuracy rate.
Model 3: Linear Discriminant Analysis with Predictors-Gender: The next statistical technique proposed for this classification problem was Linear Discriminant Analysis. This method allows characterizing two or more classes of objects based on means and variances, whose results must be used as a linear classifier. The LDA model was capable of successfully classified 402 females as females, and 321 males as males. On the other side, it incorrectly classified 52 females as males, and 44 males as females. Based on these results, the model performed predictions with 88.28 percent of accuracy. The accuracy rate for the trained model and the best prior is almost the same. On the other hand, the cross-validated model using best prior, is presenting a lower accuracy rate of 49.08% (Figure 2).
Model 4: Random Forest with Ranked Predictors-Gender: The last statistical technique to be applied in this classification problem will be Random Forest. This is a more general technique that uses a multitude of decision trees to determine which class is the best for the object to be classified. The accuracy rate corresponding to the random forest technique, when using 120 ranked predictors with the mtry parameter constant, is 72.80 percent.
Using a dynamic value for mtry, it shows that the best configuration for this set of ranked predictors correspond to best 68, where the accuracy rate ends up at 73.02 percent.
Model 5: Regression with Ranked Predictors-Age: The second subject characteristic selected as the response variable was Age. Similar to the previous models, the regression analysis will be performed using the ranked predictors, thus the model accuracy could be compared with the others at the same level. The model obtained using regression analysis contains 12 variables with p-values lower than 0.05 for statistical significance, intercept included. While the r-squared has a value of 0.2881, the adjusted r-squared has a lower value of
Figure 2. Graphical representation of the accuracy given different numbers of predictors for Gender.
0.1658. Based on these results, the model performed predictions with 9.1 percent of accuracy.
Model 6: Random Forest with Ranked Predictors-Age: We now apply Random Forest to the same set of ranked predictor and evaluate the performance improvement. The tuning parameter “mtry” was held constant at a value of 11. The Random Forest with constant mtry got similar results to the previous model. The r-squared value is also close to 9 percent.
Model 7: Random Forest with Ranked Predictors-Age as Categorical: Initially the ordinal response variable was transformed to numerical type as a way to avoid losing order information. This process was done taking the mid-point of the range of every category. Because the model did not perform well, the same statistical technique is now applied over the same set of values but using a classification perspective. Random Forest allows performing models for both, prediction and classification cases. The first configuration will be maintaining mtry constant value of 11 over the whole procedure.
The model was able to classify categories with an accuracy of 47 percent. Now let us see if there is a change coming from setting mtry dynamic. Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 113.
When configuring the mtry value as dynamic, the best configuration for this set of ranked predictors corresponds to 113 selected predictors, where the accuracy rate ends up at 48.04 percent (Figure 3).
5. Accuracy Assessment and Recommendations
Having applied four different statistical methods (Regression, Logistic Regression, Linear Discriminant, Random Forest) to classify/predict two relevant subject’s traits, it is possible to make assessments on how these models performed based on the accuracy rate obtained with each method. For contrast purposes, all models were performed using the same set of ranked predictors, which makes
Figure 3. Graphical representation of the accuracy given different numbers of predictors for Age.
possible to determine the best choice using a similar amount of computational resources. The following table shows a summary of the accuracy measurement for each technique at every level of optimization.
Both prediction and classification analysis get different accuracy measurements at each level of the process. The standard level corresponds to the training of the model using the entire dataset and using the same values to predict. The second level corresponds to the same standard process but adding an optimization technique to determine the best threshold. The last level represents a cross-validation procedure utilizing the optimal threshold.
The motivation for using cross-validation is to avoid overfitting. Without cross-validation, the accuracy measure only tells how the model performs in that specific dataset. The main interest in this case is having an accuracy measure that could represent the correctness of the model for any new dataset of this type. For this reason, the goodness of the model will be evaluated based on cross-validated results (Table 1).
Selecting 120 ranked predictors to perform each statistical technique was needed in order to balance between getting an adequate accuracy rate, managing viable computational times, and avoiding irrelevant predictors. The linear discriminant technique had a good performance using the optimal prior, but it fell down in the cross-validation procedure going from 88.28 to 49.08 percent accuracy rate. For this reason, this was the first discarded technique of the three used to model gender. Random Forest also performed well using mtry set constant and little bit better when the parameter was dynamic. It went from 72.80 to 73.02 percent accuracy rate. It was the most robust technique, allowing to model gender when using over a thousand predictors. The results with more than 200 predictors were not included here, because they did not affect much the accuracy rate (about 1 percent better, but 5 times slower on the computational side). Although the Random Forest model had a good performance and the best robustness, it was discarded because the last two models outperformed its results.
Table 1. Table of the accuracy measurement for each technique at every level of optimization.
Logistic Regression performed best in classifying the subject gender based on functional connectivity. The AIC Logistic Regression model was capable of getting an 85.6 percent accuracy rate. Alternatively, the Logistic Regression model maintaining the entire set of ranked predictor was capable of getting an 87.7 percent accuracy rate. It is interesting to point out that the model with the AIC features was better in classifying males, whereas the complete ranked model was better in classifying females.
Even though the Logistic Regression technique was not as robust as the Random Forest, it was able to get better accuracy rates after cross-validation. Moreover, because this type of model is based purely on linear relationships, is easier to explain and it can be easier implemented by other researchers with low or no expertise in statistical analysis.
When considering Age as the response variable, the first technique, corresponding to regression analysis, failed trying to capture the pattern to predict the subject’s age. This variable was given as an ordinal type level of measurement. The first approach consisted of converting each category to continuous in order to avoid losing information coming from the order. In the same way, Random Forest was performed using the same specification and also failed, getting an r-squared of 9.75 and 9.10 for the regression technique.
The results improved when the variable was treated as a nominal type with five categories. The Random Forest technique using mtry dynamic was capable of getting 48.80 percent accuracy rate. Any set of predictors between 200 and 1600 was presenting similar rates of accuracy.