rch, we selected a wrapper FS method called Feature Importance. The Feature Importance is a method that estimates the importance of features using bagged decision trees like random forest and extra trees  .
Our goal in this project is not to improve machine leaning algorithms but compare the performance of different algorithms using a subset of features against the complete set of features. Therefore, we use a simple FS method that serves our goals.
4. Classification Techniques
Classification is a process of finding a model/function that recognizes, describes, and differentiates two or more classes. The purpose of using a classification model is to predict the class of an object where its class label is unknown. Classification has a predictive nature―for a given set of features, the goal is to attempt to predict the value of another feature  . A classifier model is used to classify the actual data into defined classes. Ultimately, patterns need to exist in the data that can be exploited. The classification techniques used in this paper are explained below.
1) Multinomial Logistic Regression:
A Logistic Regression is a binary classifier that can distinguish between two classes. Multinomial Logistic Regression (MLR) is a logistic regression that is designed to solve multiclass/multinomial classification problems. In other words, Multinomial Logistic Regression is a model that predicts the probabilities of different possible outcomes of a categorically distributed dependent variable, given a set of independent variables  . It is used when the dependent variable is nominal and falls into one of many (i.e. three or more) classes/categories (e.g. the MNIST dataset has nine classes).
MLR classifier is commonly used as an alternative to naive Bayes classifier since it does not assume statistical independence of the random features that used as predictors. The MLR classifier is simple and requires a little time to learn, however, it becomes slow when use a very large number of classes to learn.
2) Decision Tree:
A Decision Tree (DT) classifier is a flow-chart-like structure, where the topmost node represents the root node of the tree, each intermediate node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf node indicates a class label. The paths from root to leaf represent classification rules.
DT is a simple model to understand and interpret. However, computation can get very complex if many outcomes are liked (i.e. a decision tree reaches a great depth). The DT classifier becomes biased in favor of attributes with more levels/depth; however, it can be easily combined with other decision techniques in order to make better and accurate decisions  .
3) Random Forest:
A Random Forest (RF) classifier  is one of the most effective techniques that is used for predictive analytics. It is an ensemble learning classifier that combines decisions from a sequence of base classifiers in order to make a decision. RF is correct for decision trees’ habit of overfitting to their training set  . Decision Trees that have great depth tend to overfit their training sets thus have low bias but very high variance. RF classifier reduces the variance by averaging multiple deep decision trees that have trained using different parts of the same training set. Although this process leads to a small increase in the bias, it improves the overall performance of the final model.
4) Boosted Trees:
A Boosted Trees (BT) or Gradient Boosting Machine (GBM) is an additive technique that is developed by Friedman  . It combines predictions from a sequence of base classifiers in order to make a prediction. Typically it is constructed of a number of simple/weak DT models where the output is the sum of the decisions of these base learners. The purpose of BT method is to achieve better predictive performance by minimizing the loss function when adding trees. Generally adding more trees makes the model more resistance to overfitting. Therefore, keep adding trees until no further improvement is observed can lead to build an efficient model that gives very accurate predictions.
5) Convolutional Neural Network:
A Convolutional Neural Network (CNN) is a deep learning method that can be used for the purpose of feature extraction as well as classification. Hence, CNN acts as a feed-forward network that extracts topological properties from images  . It extracts features from raw images (i.e. contain the intended pattern) in its first layers, and then classifies the pattern with its last layers  .
CNN is a powerful technique for image processing as well as natural language processing. The main advantage of CNN is the high accuracy in its results. However, it requires high computational cost. In addition, it needs a lot of data to be trained. The complexity of the CNN slows down the training process thus it is necessary to use a good GPU to overcome this problem.
5. Experimental Results
As previously mentioned, we use the MNIST dataset to perform the experiment of this study. In addition, we use a FS method to be examined with different classifiers: MLR, DT, RF, and BT. At the last stage of this study, we compare the same classifiers used at the previous stage in addition to CNN. In this section, we describe the tools that used throughout the experiment of this study, describe all the experiment stages, and finally present and discuss the obtained results.
1) Hardware and Software Tools
Since we consider some factors (e.g. training time) that are affected by the tools used in the experiment, it is worth mentioning the hardware and software that have been used throughout this research. For the hardware tools, we used a laptop of MacBook Pro, 2.8 GHz Intel Core i7 with 4 GB RAM. All the setup and computations of this research have been performed on top of the CPU. For the software tools, GraphLab Create1 is used for the implementation of this paper. GraphLab Create is a machine learning framework that is python-based libraries. It is designed to handle the major properties of real world data: scalable, variety, and complexity. The biggest advantage of using GraphLab is that it supports scalability, which allows using large datasets. It also supports different data source such as JSON, CSV, HDFS/S3 and many more. Graphlab is a complete framework that has rich libraries for data transformation and manipulation. It is worth mentioning that we import some libraries from scikit-learn for the purpose of implementing the FS method used in this paper.
Since the MNIST dataset is divided into standard training and testing sets, 80% and 20% respectively, we divide the MNIST training set into two proportions, 80% as a training set and 20% as a validation set. Therefore, the MNIST dataset is divided into three sets: training set (60%), validation set (20%), and testing set (20%). The training set is used to train the classifiers for the purpose of recognizing handwritten digits while the test set is used to assess the trained classifiers after fine-tuning them using the validation set.
Before training the classifiers, we select subsets of various sizes (i.e. number of features) using Feature Importance method. Table 1 presents the time required for selecting different subsets using Feature Importance method. Over all the subset sizes, the time required for selecting a subset is reasonable. The required time does not change much when selecting small or large subset (e.g. the required time for selecting subsets of 20% and 80% of features are 37.76 sec and 38.87 sec, respectively). However, few seconds are added to the required time whenever the number of features increases.
3) Study I: subset vs. complete set
At this study, we perform FS on the MNIST dataset in order to select the best subset of features to be compared with the complete set of features.
a) Select the best subset of features
In order to find the best subset, we train the MLR, DT, RF, and BT classifiers using all the subsets selected by the Feature Importance method at the pre-processing stage. The results show that the validation error does not change much after the subset of 60% of features. Moreover, the validation error becomes fairly stable by using subsets of more than 60% of features. Accordingly, the subset of 60% of features is used for the comparison with the complete set of features.
b) Train and Fine-tune classifiers
After we examined different sizes of features, we train the classifiers (MLR, DT, RF, and BT) using the subset of 60% of features as well as complete set of features (784 features).
The last step before we draw the comparison is to fine-tune the classifiers using the validation set in order to improve their performances. The classifiers trained with the subset of features as well as complete set of features are fine-tuned
Table 1. Time required to select various sizes of subset using feature importance method.
using the validation set. The MLR and the RF classifiers have been fine-tuned by increasing the number of iterations while the DT and BT classifiers have been fine-tuned by controlling the tree depth and the number of trees, respectively. As shown in Figure 3, the MLR and RF have the same validation error after 15 iterations. For the comparison, models with 50 iterations are enough since the validation accuracy stabilized in both classifiers. However, the RF has not really improved by increasing the number of iterations. The validation accuracy of the DT classifier stabilizes after depth 30 thus we limit the depth of the DT classifier to 30. Since the validation accuracy of the BT does not improve significantly after using 130 trees, we limit the number of trees to 130.
c) Compare subset against complete set
The last step of this study is to compare the performance of the classifiers using the subset of features (60%) against the complete set of features. As shown in Table 2 and Table 3, training the classifiers using a subset of features improves the training time as well as gives about the same test accuracy as the classifiers trained with the complete set of features. Moreover, the training time of all the classifiers has significantly reduced by using only 60% of the complete set of features.
4) Study II: suitable classification technique
In this study, we examine various classification techniques in order to find the most suitable techniques for handwritten digit recognition. The classification techniques used in this study are MLR, DT, RF, BT and CNN.
Figure 3. Validation error of classification techniques over different number of iterations.
Table 2. Evaluation of different classification techniques using subset of features (60%).
Table 3. Evaluation of different classification techniques using complete set of features.
a) Train and fine-tune the classification techniques
In order to have a fair and valuable comparison, all the classification techniques are trained using the complete set of features. The MLR, DT, RF, and BT classifiers are trained as same as they trained in the Study I. A one layer CNN has been trained to classify an input to one of the ten classes (0 - 9) available in the dataset. Similar to the other classifiers, the CNN is fine-tuned by increasing the number of iterations in order to fulfill the training criteria.
b) Compare the classification techniques
Table 3 presents the training accuracy, validation accuracy, test accuracy, and training time for all techniques after fine-tuning them based on the validation set. In this comparison, we did not consider the time required for testing since we use single classifiers (i.e. not combined classifiers) that do not require more than few seconds to predict any new input. As shown in Table 3, the MLR, DT, and RF have low test accuracy compared with the BT and CNN. However, they outperform the BT and CNN in terms of training time. The DT has the lowest training time (71.11 sec), although it has the lowest test accuracy (90.94%) among the four techniques. The CNN outperforms all the techniques in terms of test accuracy (98.36%). The BT has the highest training accuracy (100%) and the second highest test accuracy (97.91%). However, the BT requires the longest time to be trained among all other techniques (1396.03 sec).
Figure 4 shows the confusion matrix of the BT and the CNN since they have the highest test accuracy compared with other techniques. The confusion matrix shows a more detailed breakdown of correct and incorrect classifications of each class on the MNIST test set. It composed of target (actual) label, predicted label, and the count of predicting the target label as the predicted label. As shown in Figure 4, the BT and CNN has misclassified digit “5” 19 times, mostly it was classified as digit “3”. The CNN outperforms the BT in classifying all the digits except digit “6” and “8”. The highest misclassified digit using the BT classifier is digit “7” that has been classified as digit “2” 15 times, while digit “6” is the most misclassified digit using the CNN, 16 mistakes. Using the CNN, digit “6” was often classified as digit “0”. However, the CNN was the best in classifying digit “0”, only 3 mistakes. On the other hand, the BT classifier was the best in classifying digit “1”, 9 mistakes. In general, we suggest that most of the mistakes made by the BT and CNN were due the same shape that some digits have (e.g. “0” and “6”). Figure 5 shows the common and individual mistakes of the CNN and BT when classify digit “7” as digit “2”.
Figure 4. Confusion matrix: (a) BT confusion matrix; (b) CNN confusion matrix.
(a) (b) (c) (d)
Figure 5. Digit “7” classified as digit “2” by CNN and BT (a) common mistakes between CNN and BT (b) individual mistakes by CNN (c) individual mistakes by BT.
It is worth mentioning that the results obtained in the previous section can be optimized using different methods. We convinced of these results since our target was to find the most suitable classification techniques that can be used to solve the problem of recognizing handwritten digits. From this experiment, we can conclude that the CNN and the BT are the most suitable classification techniques among all other techniques used in this research or similar to them. Since the BT classifier requires a long time to be trained using the complete set of features, we suggest training the BT using a subset of 60% of features using FS method. This reduces the required time, complexity of the trained model, and required storage for digit recognition while still provides very similar performance to the model trained with the complete set of features.
As we discussed in the related work section, Liu et al. in  claim that training classifiers using the complete set of features available in the MNIST dataset gives unsatisfactory performance. However, the obtained results in the implementation section (Study I) show that the models trained with the complete set of features have accuracies not less than the accuracies obtained for the models trained using FS. Therefore, there is about 40% of features exist in the MNIST dataset are irrelevant/redundant features.
It is worthless to compare our results to other results since most of the related work has some limitations. Most of the related work available in the literature does not specify the type of the reported accuracy: training, validation, or testing accuracy. However, it is very important to identify the type of the obtained accuracy since the training accuracy could reach 100% if we kept fine-tuning the model until perfectly fit the training dataset. Although this leads to overfitting problem that causes a reduction in the test accuracy. We can notice from Table 3 that the BT classifier achieves 100% training accuracy without affecting the test accuracy. This is because the BT is an ensemble classifier that tends to be robust to overfitting, however, it will eventually overfit.
Many studies fine-tune and assess models using the same data (testing set). This leads the test results to be overly optimistic, which means that the model is trained to give a good estimation on the test results. Therefore, assessing the models using the test set will be worthless since the model will give bad results once used to predict new data. To avoid this problem, the model should be fine-tuned using the validation set and then assessed using the test set.
The last point of this discussion is that some studies that use un-standard datasets do not present the distribution of the instances/examples available in the dataset. Imbalance datasets lead to have a high test accuracy with many false positive predictions. Therefore, the accuracy metric is not enough to evaluate the classification techniques since it does not capture everything. The confusion matrix is a very useful metric that shows the details of the predictions made by the trained models.
7. Conclusion & Future Work
This paper proposed a subset of 60% of features to train classification techniques for digit recognition. In addition, it has implemented and compared models of different algorithms: linear, non-linear, ensemble, and deep learning to study their suitability for digit recognition. The obtained results show that the proposed subset of features is not only minimized but also reduces the required time for training the model. Moreover, it reduces the computational complexity and lowers the required storage for digit recognition. The BT and CNN proved their suitability for digit recognition in terms of accuracy. However, the BT classifier requires using a subset of features since it needs a long time to be trained. We evaluated the performance in terms of accuracy, training time, and confusion matrix.
In future, we plan to examine more FS methods with different classification techniques. Combining various FS methods is another area to be explored. Based on our observations from the confusion matrices of the BT and CNN, some digits are well recognized by some classifiers while misclassified by other classifiers. This observation motivates us to try FS with a combined machine learning techniques in order to enhance the required time as well as accuracy.