In the last few decades, a large number of methods for classification have been developed . Among the most widely used techniques are K-Nearest-Neighbors (KNN) , artificial neural networks (ANNs)    , support vector machines (SVMs)    and ensembles of classification trees such as random forest (RF)  .
Algorithms based on decision trees (DT), are easy to apply, as a fewer number of parameters need to be estimated; hence, these have high degrees of automation . However, this comparative advantage of DT with respect to ANN can be hidden by a tendency to overfit data . For these reasons, both ANN and DT are, in recent years, being replaced by more advanced, simpler to train machine learning algorithms (MLAs). During the past decade, the family of kernel methods such as SVM   and ensembles of trees such as RF   have emerged as very promising methodologies for classification purposes.
Several studies demonstrate that, MLAs are more accurate than statistical techniques such as discriminant analysis or logistic regression, especially when the feature space is complex or the input datasets are expected to have different statistical distributions  . As computational power has increased, MLAs have gained greater attention and the quality of pattern recognition systems has also increased correspondingly . Thus, in most classification studies, RF, KNN and SVM are reported as the foremost classifiers producing high accuracies .
The basic steps to decide which algorithm to use will depend on a number of factors such as the number of examples in training set, dimensions of featured space, whether there are correlated features and whether overfitting is a problem . Once these concerns have been addressed, the algorithm to use is then decided. Using methods of statistical physics, the generalization performance of SVMs, which have been recently introduced as a general alternative to neural networks (NN), were investigated . It was evident from the study that for nonlinear classification rules, the generalization error saturates on a plateau when the number of examples is too small to properly estimate the coefficients of the nonlinear part. When trained on simple rules, it was found that SVMs overfit only weakly . The performance of SVMs is strongly enhanced when the distribution of the inputs has a gap in feature space.
To avoid human introduced biases, Raczko and Zagajewski  used a 0.632 bootstrap procedure to evaluate three nonparametric classification algorithms (SVM, RF and ANN) in an attempt to classify the five most common tree species. The classification results indicated that, ANN achieved the highest median overall classification accuracy (77%) followed by SVM with 68% and RF with 62%. Analysis of the stability of results concluded that RF and SVM had the lowest variance of overall accuracy and κ (kappa) coefficient (12 percentage points) while ANN had 15 percentage points variance in results. A study showed that there exist some data distributions where maximal unpruned trees used in the RF do not achieve as good performance as the trees with smaller number of splits and/or smaller node size . This was an improvement on the work reported earlier that RF do not overfit as the number of trees grows . Thus, application of RF in general requires careful tuning of the relevant classifier parameters . Bosch et al.  demonstrated that using random forests/ferns with an appropriate node test reduces training and testing costs significantly over a multi-way SVM and has comparable performance.
The performances of various classification methods however, still depend greatly on the general characteristics of the data to be classified . The exact relationship between the data to be classified and the performance of various classification methods still remains to be determined. Thus far, there has been no classification method that works best on any given problem . There have been various problems associated with classification methods in current use . Therefore, to determine the best classification method for a certain dataset, a trial and error approach is used to decide on the best performance.
In this review paper, the performances, strengths and shortcomings of the KNN, SVM, RF and NN classifiers are examined and compared. Answers to the following questions are sought. What are the strengths and weaknesses of these algorithms on a set of classification problems? Which one performs better and under what conditions does one classifier perform better than the others? The four nonparametric classification methods were therefore, evaluated on the following; robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy.
2. Materials and Methods
2.1. Support Vector Machines (SVM)
Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The SVM algorithm developed by Cortes and Vapnik  tries to find the optimal hyperplane in n-dimensional classification space with the highest margin between classes (Figure 1).
The SVM algorithm is often reported to achieve better results than other classifiers , although it has been indicated that the main reason to use an SVM instead is because the problem might not be linearly separable . In that case, an SVM with a non-linear kernel such as the Radial Basis Function (RBF) would be suitable. Another related reason to use SVMs, is if one is in a high dimensional space. For example, SVMs have been reported to work better for text classification although this requires a lot of time for training .
Figure 1. A simple illustration of the Support Vector Machine (SVM) algorithm in 2-dimensions.
The SVM is an extension of the support vector classifier and is obtained as a result from the enlargement of the feature space in a specific way, using kernels .
Representation of linear support vector classifier is as shown in Equation (1):
where and are parameters which are estimated by inner products between all pairs of training observations. Replacing the inner product with , where K is some function called the kernel. Linear kernel is represented as shown in Equation (2):
Polynomial kernel of degree d (where d is positive) can be represented as shown in Equation (3):
Classification results of the combination of non-linear kernel and support vector classifier are called the SVM (Equation (3)).
The SVM classifier, which is particularly designed for binary classification, is a kernel-based supervised learning algorithm that classifies the data into two or more classes and it is not recommended when there are a large number of training examples . A kernel function is a mapping procedure done to the training set to improve its resemblance to a linearly separable data set. The purpose of mapping is to increase the dimensionality of the data set and it is done efficiently using a kernel function. Some of the commonly used kernel functions are linear, RBF, quadratic, Multilayer Perceptron kernel and Polynomial kernel . The linear kernel function performs well with linearly separable data set and the RBF kernel function performs well with non-linear data set. The linear kernel function takes less time to train the SVM compared with the RBF kernel function. The linear kernel function is also less prone to overfitting compared with the RBF kernel function .
The performance of the SVM classifier relies on the choice of the regularization parameter C which is also known as box constraint and the kernel parameter which is also known as the scaling factor. Together they are known as the hyperplane parameter . During the training phase, SVM builds a model, maps the decision boundary for each class and specifies the hyperplane that separates the different classes. Increasing the distance between the classes by increasing the hyperplane margin helps increase the classification accuracy. SVMs can also be used to effectively perform non-linear classification .
SVMs have been successfully applied in many diverse fields including text and hypertext categorization , image detection, veriﬁcation, and recognition , speech recognition , bankruptcy prediction , remote sensing image analysis , time series forecasting , information and image retrieval , information security , biological i.e. bioinformatics and classification of proteins  and chemical sciences e.g. data from spectroscopy, i.e., chromatography-mass spectrometry and the neutron magnetic resonance .
2.2. K-Nearest Neighbor (KNN)
In pattern recognition, the KNN algorithm is an instance based learning method used to classify objects based on their closest training examples in the feature space. An object is classified by a majority vote of its neighbors, that is, the object is assigned to the class that is most common amongst its k-nearest neighbors (Figure 2), where k is a positive integer . In the KNN algorithm, the classification of a new test feature vector is determined by the classes of its k-nearest neighbors.
The KNN algorithm is implemented using Euclidean distance metrics to locate the nearest neighbor . The Euclidean distance metrics between two points x and y is calculated using Equation (4).
where, N is the number of features such that, and
The KNN classifier is one of the many approaches that attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability. Given a positive integer K and a test observation, , the KNN classifier first identifies the K points in the training data that are closest to , represented by . It then estimates the conditional probability for class j as the fraction of points in whose response values equal j as indicated in Equation (5):
Figure 2. A simple pictorial overview of the K-Nearest Neighbor (KNN) algorithm.
where, is an indicator variable that equals 1 if and zero if .
KNN is robust to noisy training data and is effective for large numbers of training examples. But for this algorithm, the value of parameter k (number of nearest neighbors) and the type of distance to be used have to be determined. The computation time can be lengthy as one needs to compute the distance of each query instance to all training samples and it gets significantly slower as the number of examples and/or predictors/independent variables increase . Nevertheless, there is no need to build a model, tune several parameters or make additional assumptions. KNN is a simple, versatile, easy to implement supervised MLA that can be used to solve classification, regression and search problems. The algorithm assumes that similar items exist in close proximity. In other words, similar items are near to each other and that ‘birds of a feather flock together’. The KNN algorithm hinges on this assumption being true enough for it to be useful .
KNN’s main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where predictions need to be made rapidly . Moreover, there are faster algorithms that can produce more accurate classification and regression results. However, provided there are sufficient computing resources to speedily handle the data for making predictions, KNN can still be useful in solving problems that have solutions that depend on identifying similar objects .
To select the K that is right for a dataset, the KNN algorithm is run several times with different values of K and the K that reduces the number of errors encountered is chosen while maintaining the ability of the algorithm to accurately make predictions when it is applied to data for which it has no prior contact . There are other ways of calculating distance and one way might be preferable depending on the problem that is being solved. However, the straight-line distance, also called the Euclidean distance, is a popular and familiar choice .
As the value of K decreases to 1, the predictions become less stable. Inversely, as the value of K is increased, the predictions become more stable due to majority voting/averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, an increasing number of errors is witnessed. It is at this point that one recognizes that the appropriate value of K has been exceeded. The value of K is usually an odd number to have a tiebreaker in cases where a majority vote among labels is required, for example, picking the mode in a classification problem . The KNN algorithm can be used for classification, regression, and search problems. It is useful in solving problems that have solutions that depend on identifying similar objects.
2.3. Random Forest
Recently there has been a lot of interest in ensemble learning, that is, methods that generate many classifiers and aggregate their results. Two well-known methods are boosting  and bagging  of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees, each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction .
An RF classifier consists of a number of trees, with each tree grown using some form of randomization (Figure 3). The leaf nodes of each tree are labeled by estimates of the posterior distribution over the image classes. Each internal node contains a test that best splits the space of data to be classified . An image is classified by sending it down every tree and aggregating the reached leaf distributions. Randomness can be injected at two points during training: in sub-sampling the training data so that each tree is grown using a different subset, and in selecting the node tests .
The number of trees necessary for good performance grows with the number of predictors. The best way to determine how many trees are necessary is to compare predictions made by a forest to predictions made by a subset of a forest. When the subsets work as well as the full forest, it indicates there are enough trees. For selecting, mtry, Breiman  suggests trying the default, half of the default, and twice the default, and then select the best. If one has a very large number of variables but expects only very few to be “important”, using a larger mtry may give better performance. A lot of trees are necessary to get stable estimates of variable importance and proximity. Since the algorithm falls into the “embarrassingly parallel” category, one can run several random forests on different machines and then aggregate the votes components to get the final result .
The RF classifier adds an additional layer of randomness to bagging . In addition to constructing each tree using a different bootstrap sample of the data, RFs change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables while in an RF, each node is split using the best among a subset of predictors randomly
Figure 3. A pictorial overview of the random forest (RF) algorithm.
chosen at that node . This somewhat counterintuitive strategy turns out to perform very well compared with many other classifiers, including discriminant analysis, SVMs and NNs, and is robust against overfitting  . In addition, RF is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values  .
RF is essentially, a set of DTs combined where each tree votes on the class assigned to a given sample, with the most frequent answer winning the vote . This algorithm can handle categorical features very well, can also handle high dimensional spaces as well as a large number of training examples . RF are quite versatile and hence their popularity and application in diverse fields. A decision tree is a set of conditions organized in a hierarchical structure. It is a predictive model in which an instance is classified by following the path of satisfied conditions from the root of the tree until reaching a leaf, which will correspond to a class label. A DT can easily be converted to a set of classification rules .
The following types of scientific and engineering data are amenable to RF: DNA data, micro-array data, spectral data: NMR chemical data and molecular structure prediction, quality assessment of manuscripts published in a particular journal, finding clusters of patients based on, for example, tissue marker data, symptoms of a particular disease among others.
2.4. Neural Networks
A NN classifier can be described as a parallel computing system consisting of an extremely large number of simple processors with interconnections  . One commonly used type of neural network is a multilayered feed-forward perceptron that consists of several layers of neurons connected with each other (Figure 4). The multilayered perceptron can separate data that are nonlinear and generally consists of three or more types of layers .
McCulloch and Pitts  are generally credited as the designers of the first neural network and earliest mathematical models. Many of their ideas, like many simple units combine to give increased computational power and the idea of a threshold are still used today. The first learning rule on NN was developed on
Figure 4. A depiction of a neural network (NN) structure.
the premise that if two neurons were active at the same time the strength between them should be increased . Further improvements and simulations were achieved . During the decades of 1950 and 1960, many researchers worked on the perceptron amidst great excitement, however, by the year 1969, enthusiasm for NN research had waned . Interest for NN research was rekindled in the mid-1980’s rekindling . Because of their ability to reproduce and model nonlinear processes, NN have found applications in a wide area of sectors: computer vision, speech recognition, machine translation, social network filtering, playing board and video games, medical diagnosis: data mining, cancers, including lung cancer, prostate cancer, colorectal cancers, quantum chemistry among others.
3.1. Assessing the Performance of a Model
With classification, it is sometimes necessary to use accuracy to assess the performance of a model. Consider analyzing a highly imbalanced data set. For example, trying to determine if a transaction is fraudulent or not, but only 0.5% of the data set contains a fraudulent transaction. Then one could predict that none of the transactions will be fraudulent and have a 99.5% accuracy score which is very misleading. So usually the sensitivity and specificity are used. Using the fraud detection problem, the sensitivity is the proportion of fraudulent transactions identified as fraudulent. The specificity is the proportion of non-fraudulent transactions identified as non-fraudulent.
Therefore, in an ideal situation, what is required are high sensitivity and specificity, although that might change depending on the context. For example, a bank might want to prioritize a higher sensitivity over specificity to make sure it identifies fraudulent transactions. The ROC curve (receiver operating characteristic) is good to display the two types of error metrics described above. The overall performance of a classifier is given by the area under the ROC curve (AUC). Ideally, it should hug the upper left corner of the graph, and have an area close to 1.
3.2. Attributes of the Classification Algorithms
KNN classifies data based on the distance metric whereas SVM need a proper phase of training. Due to the optimal nature of SVM, it is guaranteed that the separated data would be optimally separated . Generally, KNN is used as multi-class classifiers whereas standard SVM separate binary data belonging to one class or the other. Although, SVMs look more computationally intensive, once training of data is done, that model can be used to predict classes even when applied to new unlabeled data . However, in KNN, the distance metric is calculated each time a set of new unlabeled data is introduced. Hence, in KNN the distance metric always has to be defined . SVMs have two major cases in which classes might be linearly separable or non-linearly separable . When the classes are non-linearly separable, a kernel function such as Gaussian basis function or polynomials is used. Hence, in KNN, only the K parameter have to be set and the distance metric suitable for classification selected whereas in SVMs the R parameter (Regularization term) and also the parameters for kernel if the classes are not linearly separable have to be selected . A main advantage of SVM classification is that it performs well on datasets that have many attributes, even when there are only a few cases that are available for the training process . However, several disadvantages of SVM classification include limitations in speed and size during both training and testing phase of the algorithm and the selection of the kernel function parameters .
KNN is easy to implement and understand, but has a major drawback of becoming significantly slow as the size of that data in use increases . KNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression) . In the case of classification and regression, choosing the right K for a set of data is done by trying several Ks and picking the one that works best. However, KNN is less computationally intensive and easy to implement than SVM hence it is mostly used in the classification of multi-class data . The algorithm that guarantees reliable detection in unpredictable situations depends upon the data. If the data points are heterogeneously distributed, both KNN and SVM work well  . For homogeneous data, one might be able to classify better by putting in a kernel into the SVM. For most practical problems, KNN is a bad choice because it scales badly, if there are a million labelled examples, it would take a long time (linear to the number of examples) to find K nearest neighbors .
Different factors affect the capacity of NN to generalize, that is, to predict new data from the learning carried out with training data. The intrinsic factors to network design include the number of neurons and network architecture . The problem of how to define the most suitable network architecture is related to the nature of the hidden layer. There is no rule for determining the number of hidden layers, but, theoretically, one single hidden layer can represent any Boolean function . In general terms, the higher the number of units of the hidden layer, the greater the NN capacity to represent the training data patterns. However, the fact that the hidden layer has a high number of units also produces a loss in the networks’ generalization power   .
Unlike most methods based on machine learning, RF only needs two parameters to be set for generating a prediction model, that is, the number of regression trees and the number of evidential features (m) which are used in each node to make regression trees grow . It has been demonstrated that with RF, by increasing the number of trees the generalization error always converges; hence, overtraining is not a problem . On the other hand, reducing the number of m brings as a result a reduction in the correlation among trees, which increases the model’s accuracy .
Adding more data would lengthen NN training times to unacceptable levels so that it would be highly impractical to work with them. Larger input datasets will lengthen classification times for NN more than for SVM and RF . NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, the numerous types of neural network architectures to choose from, and the high number of algorithms used for training NN, some researchers recommend SVM or RF as easier methods which repeatedly achieve results with high accuracies and are often faster  .
The performance characteristics and attributes of the four types of non-parametric classification algorithms are summarized in Table 1.
Table 1. Comparison of the four non-parametric classification algorithms.
The assessed algorithms have different difficulties in their training. DT based algorithms (RF) involve a lesser difficulty in their training. This applies to both simple regression trees and ensembles of trees (RF). When the data are very scarce RF show a better performance compared to NN and SVM which become more complex. SVMs are based on different kernel types, according to which the combination of parameters to be optimized is different. However, it should be highly emphasized that no broader generalizations can be made about the superiority of any method for all types of problems as the performance of the methods might vary for other datasets.
RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of data in use grows while the ideal value of K for the KNN classifier is difficult to set. The NN method contains a high level of complexity in computational processing, causing it to become less popular in classification applications. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Among the nonparametric methods, SVM and RF are becoming increasingly popular in image classification research and applications. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, the numerous types of neural network architectures to choose from and the high number of algorithms used for training NN, most researchers recommend SVM or RF as easier methods which repeatedly achieve results with high accuracies and are often faster.
The idea was developed by EYB and JO. Literature was reviewed by all authors. All authors contributed to manuscript writing and approved the final manuscript.
We thank the anonymous reviewers whose comments made this manuscript more robust.
This study attracted no funding.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
 Breiman, L. and Ihaka, R. (1984) Nonlinear Discriminant Analysis via Scaling and ACE. Department of Statistics, University of California, Berkeley.
 McCloskey, M. and Cohen, N.J. (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation, 24, 109-165.
 Brown, W.M., Gedeon, T.D., Groves, D.I. and Barnes, R.G. (2000) Artificial Neural Networks: A New Method for Mineral Prospectivity Mapping. Australian Journal of Earth Sciences, 47, 757-770.
 Rigol-Sanchez, J.P., Chica-Olmo, M. and Abarca-Hernandez, F. (2003) Artificial Neural Networks as a Tool for Mineral Potential Mapping with GIS. International Journal of Remote Sensing, 24, 1151-1156.
 Boser, B.E., Guyon, I.M. and Vapnik, V.N. (1992) A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, July 1992, 144-152.
 Abedi, M., Norouzi, G.H. and Bahroudi, A. (2012) Support Vector Machine for Multi-Classification of Mineral Prospectivity Areas. Computers & Geosciences, 46, 272-283.
 Rodriguez-Galiano, V.F. and Chica-Rivas, M. (2014) Evaluation of Different Machine Learning Methods for Land Cover Mapping of a Mediterranean Area Using Multi-Seasonal Landsat Images and Digital Terrain Models. International Journal of Digital Earth, 7, 492-509.
 Gutiérrez, S.L.M., Rivero, M.H., Ramírez, N.C., Hernández, E. and Aranda-Abreu, G.E. (2014) Decision Trees for the Analysis of Genes Involved in Alzheimer’s Disease Pathology. Journal of Theoretical Biology, 357, 21-25.
 Al-Anazi, A. and Gates, I.D. (2010) A Support Vector Machine Algorithm to Classify Lithofacies and Model Permeability in Heterogeneous Reservoirs. Engineering Geology, 114, 267-277.
 Waske, B. and Braun, M. (2009) Classifier Ensembles for Land Cover Mapping Using Multitemporal SAR Imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 64, 450-457.
 Chen, W., Xie, X.S., Wang, J.L., Pradhan, B., Hong, H.Y., Bui, D.T., Duan, Z. and Ma, J.Q. (2017) A Comparative Study of Logistic Model tree, Random Forest, and Classification and Regression Tree Models for Spatial Prediction of Landslide Susceptibility. Catena, 151, 147-160.
 Ghimire, B., Rogan, J., Galiano, V.R., Panday, P. and Neeti, N. (2012) An Evaluation of Bagging, boosting, and Random forests for Land-Cover Classification in Cape Cod, Massachusetts, USA. GIScience & Remote Sensing, 49, 623-643.
 Rodriguez-Galiano, V.F., Ghimire, B., Rogan, J., Chica-Olmo, M. and Rigol-Sanchez, J.P. (2012) An Assessment of the Effectiveness of a Random Forest Classifier for Land-Cover Classification. ISPRS Journal of Photogrammetry and Remote Sensing, 67, 93-104.
 Pierdicca, R., Malinverni, E.S., Piccinini, F., Paolanti, M., Felicetti, A. and Zingaretti, P. (2018) Deep Convolutional Neural Network for Automatic Detection of Damaged Photovoltaic Cells. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 42, 893-900.
 Raczko, E. and Zagajewski, B. (2017) Comparison of Support Vector Machine, Random Forest and Neural Network Classifiers for Tree Species Classification on Airborne Hyperspectral APEX Images. European Journal of Remote Sensing, 50, 144-154.
 Segal, M.R. (2004) Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco.
 Statnikov, A., Wang, L. and Aliferis, C.F. (2008) A Comprehensive Comparison of Random Forests and Support Vector Machines for Microarray-Based Cancer Classification. BMC Bioinformatics, 9, Article No. 319.
 Bosch, A., Zisserman, A. and Munoz, X. (2007) Image Classification Using Random Forests and Ferns. 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, 14-21 October 2007, 1-8.
 Zuo, R.G. and Carranza, E.J.M. (2011) Support Vector Machine: A Tool for Mapping Mineral Prospectivity. Computers & Geosciences, 37, 1967-1975.
 Sluiter, R. and Pebesma, E.J. (2010) Comparing Techniques for Vegetation Classification Using Multi- and Hyperspectral Images and Ancillary Environmental Data. International Journal of Remote Sensing, 31, 6143-6161.
 Huang, C., Davis, L.S. and Townshend, J.R.G. (2002) An Assessment of Support Vector Machines for Land Cover Classification. International Journal of Remote Sensing, 23, 725-749.
 Maji, S., Berg, A.C. and Malik, J. (2008) Classification Using Intersection Kernel Support Vector Machines is Efficient. 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 23-28 June 2008, 1-8.
 Shen, T., Li, H.S., Qian, Z. and Huang, X.L. (2009) Active Volume Models for 3D Medical Image Segmentation. IEEE Conference on Computer Vision and Pattern Recognition, Miami, 20-25 June 2009, 707-714.
 Tsai, C.F., Hsu, Y.F., Lin, C.Y. and Lin, W.Y. (2009) Intrusion Detection by Machine Learning: A Review. Expert Systems with Applications, 36, 11994-12000.
 Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N. and Watkins, C. (2002) Text Classification Using String Kernels. Journal of Machine Learning Research, 2, 419-444.
 Smeraldi, F. and Bigun, J. (2002) Retinal Vision Applied to Facial Features Detection and Face Authentication. Pattern Recognition Letters, 23, 463-475.
 Ganapathiraju, A., Hamaker, J.E. and Picone, J. (2004) Applications of Support Vector Machines to Speech Recognition. IEEE Transactions on Signal Processing, 52, 2348-2355.
 Shin, K.S., Lee, T.S. and Kim, H.J. (2005) An Application of Support Vector Machines in Bankruptcy Prediction Model. Expert Systems with Applications, 28, 127-135.
 Melgani, F. and Bruzzone, L. (2004) Classification of Hyperspectral Remote Sensing Images with Support Vector Machines. IEEE Transactions on Geoscience and Remote Sensing, 42, 1778-1790.
 Liu, Y.H. and Chen, Y.T. (2007) Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks, 18, 178-192.
 Mukkamala, S., Janoski, G. and Sung, A. (2002) Intrusion Detection Using Neural Networks and Support Vector Machines. Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, 12-17 May 2002, 1702-1707.
 Zhou, X. and Tuck, D.P. (2007) MSVM-RFE: Extensions of SVM-RFE for Multiclass Gene Selection on DNA Microarray Data. Bioinformatics, 23, 1106-1114.
 Hmeidi, I., Hawashin, B. and El-Qawasmeh, E. (2008) Performance of KNN and SVM Classifiers on Full Word Arabic Articles. Advanced Engineering Informatics, 22, 106-111.
 Pan, F., Wang, B.Y., Hu, X. and Perrizo, W. (2004) Comprehensive Vertical Sample-Based KNN/LSVM Classification for Gene Expression Analysis. Journal of Biomedical Informatics, 37, 240-248.
 Halvani, O., Steinebach, M. and Zimmermann, R. (2013) Authorship Verification via K-Nearest Neighbor Estimation. Notebook PAN at CLEF. CLEF 2013 Working Notes, Valencia, 23-26 September 2013.
 Chen, H.L., Huang, C.C., Yu, X.G., Xu, X., Sun, X., Wang, G. and Wang, S.J. (2013) An Efficient Diagnosis System for Detection of Parkinson’s Disease Using Fuzzy K-Nearest Neighbor Approach. Expert Systems with Applications, 40, 263-271.
 Chan, J.C.W. and Paelinckx, D. (2008) Evaluation of Random Forest and Adaboost Tree-Based Ensemble Classification and Spectral Band Selection for Ecotope Mapping Using Airborne Hyperspectral Imagery. Remote Sensing of Environment, 112, 2999-3011.
 Vincenzi, S., Zucchetta, M., Franzoi, P., Pellizzato, M., Pranovi, F., De Leo, G.A. and Torricelli, P. (2011) Application of a Random Forest Algorithm to Predict Spatial Distribution of the Potential Yield of Ruditapes philippinarum in the Venice Lagoon, Italy. Ecological Modelling, 222, 1471-1478.
 Edwin, R. and Bogdan, Z. (2017) Comparison of Support Vector Machine, Random Forest and Neural Network Classifiers for Tree Species Classification on Airborne Hyperspectral APEX Images. European Journal of Remote Sensing, 50, 144-154.
 Coimbra, R., Rodriguez-Galiano, V., Olóriz, F. and Chica-Olmo, M. (2014) Regression Trees for Modeling Geochemical Data-An Application to Late Jurassic Carbonates (Ammonitico Rosso). Computers & Geosciences, 73, 198-207.
 Wang, Z.L., Lai, C.G., Chen, X.H., Yang, B., Zhao, S.W. and Bai, X.Y. (2015) Flood Hazard Risk Assessment Model Based on Random Forest. Journal of Hydrology, 527, 1130-1141.
 Jain, A.K., Duin, R.P.W. and Mao, J.C. (2000) Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4-37.
 Beluco, A., Engel, P.M. and Alexandre, B. (2015) Classification of Textures in Satellite Image with Gabor Filters and a Multi-Layer Perceptron with Back Propagation Algorithm Obtaining High Accuracy. International Journal of Energy & Environment, 6, 437-460.
 Palaniappan, R., Sundaraj, K. and Sundaraj, S. (2014) A Comparative Study of the SVM and K-NN Machine Learning Algorithms for the Diagnosis of Respiratory Pathologies Using Pulmonary Acoustic Signals. BMC Bioinformatics, 15, Article No. 223.
 Foody, G.M. and Arora, M.K. (1997) An Evaluation of Some Factors Affecting the Accuracy of Classification by an Artificial Neural Network. International Journal of Remote Sensing, 18, 799-810.
 Petropoulos, G.P., Kalaitzidis, C. and Vadrevu, K.P. (2012) Support Vector Machines and Object-Based Classification for Obtaining Land-Use/Cover Cartography from Hyperion Hyperspectral Imagery. Computers & Geosciences, 41, 99-107.
 Shrivastava, R., Mahalingam, H. and Dutta, N.N. (2017) Application and Evaluation of Random Forest Classifier Technique for Fault Detection in Bioreactor Operation. Chemical Engineering Communications, 204, 591-598.