As shown in an action plan shown by World Health Organization in 2015, antimicrobial resistance is a really serious problem in today’s infectious disorder. Due to the fast evolution of bacteria, they obtain the ability of resisting antimicrobial drugs. Among various single- or multi-drug resistant bacteria, Methicillin-resistant Staphylococcus aureus (MRSA) is one of the most popular and serious infectious microbial. The ability of surviving under the treatment by methicillin might come from the genomic sequence of microbial, however, still it is unclear in which mutation of genome causes it. Especially, reason of phenotypic difference between MRSA strains has not been well studied.
To analyze the relationship between genotypes and phenotypes, we have red whole genome sequences of 96 MRSA strains by a next generation sequencer. After mapping the short reads from sequencer onto a reference genome sequence of MRSA, thousands of mutations called insertion or deletion (Indel) have been detected in the whole genome sequences of 96 MRSA strains. On the other hand, we prepared two phenotypes of these MRSA strains. The first phenotype is about pathogeny, and the second is about drug resistance. Then, next problem to be solved is finding the relationship between mutations and phenotypes. Actually, most of the mutations might be irrelevant from these phenotypes and only a small subset of them might cause the difference in the phenotypes of the strains. In this paper, we applied machine learning algorithms, which is classification by support vector machine and feature selection for the improvement of classification accuracy. Through the process of finding a feature subset for higher classification accuracy, features (i.e. mutations) irrelevant to a phenotype are naturally removed. As a result, we could achieve highly accurate classification by less than 10 features in average. For effectively selecting features from high dimensional binary vectors representing the existence of mutations in MRSA strains, we used cross entropy based sparse logistic regression. Since our MRSA mutation data show a certain level of sparsity (around 80% of values are zero), the algorithm is expected to improve the performance of classification.
2.1. Sparse Logistic Regression
Sparse model is an approach to reduce complexity by neglecting less influential features in the model. The sparse model leads to achieve more interpretable selected features for a typical sparse dataset. The sparsity concept has been gaining attention in many fields of application such as statistics, data mining, and signal processing . Sparsity also plays an important role in genomic studies especially in term of reducing fraction of genes in order to accurately predict disease out of non-disease samples . The idea of this technique is simply achieved by modifying the model (e.g. by adding regularization on the model) that simultaneously produces good predictions and sparse (i.e. zero coefficient for less influential features) .
One of the applications of sparse model is that in logistic regression which is a statistical model to handle a binary (dichotomous) classification problem. Many researchers have been dealing with sparse model along with logistic regression [2,4-6]. This binary response can be viewed as a nonlinear function of features. Let be a sample vector of size , be a vector of features, and be the vector of probability estimates. Logistic regression generates the coefficients of features to predict a logit transformation of the probability of sample cases:
where is the intercept and is the j-th coefficient of j-th feature. The log-likelihood function of above equation is defined as:
where ,vector of coefficients.
The advantage of logistic regression is its possibility to estimate the probabilities and simultaneously for each class and making classification.
To construct a sparse logistic regression is simply by adding a nonnegative penalty or constrain term to the logistic regression model in order to reduce the dimension of features. The most well-known penalty is one proposed by Tibshirani , the L1-penalty, also known as LASSO (least absolute shrinkage and selection operator). The L1-penalty equals to the sum of all absolute coefficients, . This penalty constrain simultaneously performs feature selection and coefficient estimation. The penalized logistic regression then becomes . The estimate of coefficient then defined as
The scalar λ is a tuning parameter. Choosing this parameter is crucial in the feature selection to successfully reach high accuracies .
2.2. Cross Entropy Method
The Cross Entropy Method (CEM) was originally introduced by Rubinstein  as an adaptive algorithm to estimate probabilities of rare events in complex stochastic network by involving minimization of variance. By then a simple modification of the original work could be used for solving a complex combinatorial optimization problem . Many applications have been shown to testify the power of the CE method for solving NP-hard problems. The advantages of the cross entropy method in comparison to other optimization methods, such as simulated annealing, tabu search, and genetic algorithm, are fast and optimal updating rules [11,12].
Briefly, the CEM involves two iterative phases: 1) Generating random data sample as candidate parameters using a specified mechanism, 2) Updating the parameters based on the data to produce “better” solution in the next iteration . The details of CEM are as follows. Suppose that we aim to minimize (or maximize) the target function S over the solution space Ω and let the corresponding minimizer. For simplicity let assume we have only one single variable to minimize S(x). Denote the minimum fitness by ,so that
So, we wish to minimize the function S(x) over all x in the solution space Ω. Based on CEM, initially we generate n samples as the candidate solution to minimize S(x). The samples can be generated according to a uniformly random distribution or a certain normal random distribution. Let’s assume, for instance, a normal random distribution. Next, we calculate the fitness of each sample and sort the samples based on the fitness. Then calculate the mean and the standard deviation only on the best fitness proportion or the elite samples. Based on this pair of mean and standard deviation we generate samples in the next iteration. These iterative procedures are conducted until the stopping criteria satisfied.
The tutorial on CEM is given in  and some examples of CEM application can be found in . The following figure shows pseudo-code of the cross-entropy algorithm for normal distribution.
Input: and % initial distribution parameters
n % sample size
ρ % elite sample size
ε % stopping criterion
d % initial criterion
t = 0% iteration
while d ≥ ε
· set t = t + 1
· generate matrix B of size n × p based on vector β0 and the corresponding standard deviation σ0
· for each column of B, fit the values to the objective function
· evaluate the fitness
· sort each column of B partially based on the fitness
· take the best ρ sample of each column of B
· calculate the means and deviations to get new parameters βt+1 and σt+1
· calculate d
3. RESULTS AND DISCUSSIONS
From male and female patients mainly more than 60 years old at Kanazawa University Hospital from 1998 to 2015, 96 strains of MRSA were taken and their DNA sequences have been extracted. DNA sequences from a strain are red by Hiseq 2500, one of the most popular and reliable next generation sequencers. The output of the sequencer is a huge amount of fixed length subsequences called short reads. In this experiment, for each strain we obtained around 6.5 million of short reads with the length of 150 base. Using Bowtie 2 software, they are mapped to HO 5096 0412, the reference genome sequence which we chose. After that, 5587 Indels were detected by using VarScan software. It means that we prepared 96 feature vectors with binary values for 5587 features. At this point, a feature corresponds to a specific insertion or deletion at a position of reference genome sequence. For instance, a feature “581245:T->TTCAGAC” corresponds to an insertion of “TCAGAC” right after the “T” at position 581245 of the reference genome sequence. Since two or more features sharing completely the same pattern of occurrence in 96 strains are harmful for classification accuracy, such redundant features are unified. Finally, 96 feature vectors with 1978 unified features have been prepared. About the phenotypes used as class labels to be predicted, a pathogenic phenotype (1: developed, 0: latent) is identified for all 96 strains. For another phenotype about drug resistance, resistance of each strain to four kinds of antimicrobial drugs (Piperacillin (PIPC), Sulbactam/Ampicillin (S/A), Cefazolin (CEZ), and Clindamycin (CLDM)) has been tested (1: PIPC, S/A, and CEZ resistant, 0: PIPC, S/A, CEZ, and CLDM resistant). Since we could not precisely identify this phenotype for two strains, 94 strains were used for predicting the drug resistance phenotype. In this context, two features became meaningless and were removed since they showed the same value (all zero or all one) for 94 strains. Therefore, 1976 unified features were used for predicting drug resistance phenotype. Details of two datasets are summarized in Table 1.
In our experiment we adopted leave one out cross validation (LOOCV) to ensure our methods work well. As a result, we have training-testing pair datasets as the number of samples, i.e. 96 for Pathogenicity dataset and 94 for Drug resistance dataset. Dealing with high dimensional dataset in some circumstance is time consuming, especially when adopting LOOCV. To avoid spending too much time for eliminating so many unimportant features, we applied random forest method to remove zero-importance features from training datasets. After removing these features, the sparse logistic regression took place to select and estimate the features and their coefficients simultaneously. We employed CE algorithm to estimate the coefficients. At the end we used support vector machine (SVM) classifier for sample classification.
As shown in Table 2 and Table 3, the performance of our proposed method applied to both datasets based on classification accuracy was almost perfect. In Pathogenicity dataset we reached 100% accuracy, while in Drug resistance dataset the accuracy was 97.87%. They outperformed the accuracies 92.1% and 95.5% achieved by the similar experiments we did with correlation-based feature selection with grid search.
Since the set of selected features are different in every iteration of cross-validation, we counted each feature’s occurrence (i.e. inclusion in feature set) through the cross validation. Relative frequency indicates the percentage of occurrence, and features with high relative frequency are shown in Figure 1 and Figure 2. For the top 4 selected features in Pathogenicity datasets, the occurrences of features are relatively to have the same level when LOOCV to apply. In contrast, we can easily see the disparity among the top 4 selected features in Drug resistance dataset. This implies that feature selection in Pathogenicity dataset might be more stable than in Drug resistance dataset.
Feature selection to achieve high accuracy classification is one of the main focuses in the data analysis for high-dimensional datasets. A powerful technique to a typical dataset does not guarantee to be appropriate for another dataset. We have shown that our method can perform good results in tackling feature
Table 1. Brief information for the datasets.
Table 2. Results for Pathogenicity dataset. Best performance is shown in bold face.
Best performance shown in boldface.
Table 3. Results for Drug resistance dataset. Best performance is shown in bold face.
Figure 1. Top features selected in pathogenicity dataset. These features are frequently included in the set of selected features through cross-validation.
Figure 2. Top features selected in Drug resistance dataset. These features are frequently included in the set of selected features through cross-validation.
selection for sparse datasets especially for the datasets we used in this work. In addition, this result could be utilized to develop a novel rapid test method in the future for checking MRSA phenotypes, if the feature set is further tested by real experiments.
We would like to express our gratitude to Professor Takashi Wada and Dr. Yasunori Iwata at Division of Infection Control, Kanazawa University for providing MRSA data. In this research, the super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. Additional computation time was provided by the super computer system in Research Organization of Information and Systems (ROIS), National Institute of Genetics (NIG). This work was supported by JSPS KAKENHI Grant Number 26330328.
 Liang, Y., Liu, C., Luan, X.-Z., Leung, K.-S., Chan, T.-M., Xu, Z.-B. and Zhang, H. (2013) Sparse Logistic Regression with a L1/2 Penalty for Gene Selection in Cancer Classification. BMC Bioinformatics, 14, 198-210.
 Liu, Z., et al. (2007) Sparse Logistic Regression with Lp Penalty for Biomarker Identification. Statistical Applications in Genetics and Molecular Biology, 6, Article 6.
 Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58, 267-288.
 Rubinstein, R.Y. (1999) The Cross-Entropy Method for Combinatorial and Continuous Optimization. Methodology and Computing in Applied Probability, 1, 127-190.
 Kroese, D.P., Porotsky, S. and Rubinstein, R.Y. (2006) The Cross-Entropy Method for Continuous Multi-Extremal Optimization. Methodology and Computing in Applied Probability, 8, 383-407.
 Kroese, D.P., Rubinstein, R.Y. and Taimre, T. (2007) Application of the Cross-Entropy Method to Clustering and Vector Quantization. Journal of Global Optimization, 37, 137-157.