Many features possessed by amino acid and features possessed by a protein have an influence on the process of protein crystallization. Doubtlessly, humans can find more and more features possessed by amino acids and features possessed by a protein with advance in science and technology, each feature provides us with a new insight from a viewpoint different from the rest of features, and nevertheless, every new feature may have a certain relationship with the crystallization propensity of proteins.
The notable features are the amino acid physicochemical features, which have been repeatedly correlated with propensity of protein crystallization . Subsequently, these features were also correlated with propensity of protein crystallization , for example, protein length, protein isoelectric point, percentage of charged residues, hydrophobicity. With the compilation of features of amino acids , efforts once again were made to correlate propensity of protein crystallization with amino acid features, which had not been used in previous studies [2 , 4].
Apparently, all known features possessed by amino acids and a protein have been tested. However, several features, which were developed by us, have not yet been widely tested against crystallization propensity of proteins. Indeed, it is necessary to test each feature against crystallization propensity of different proteins as many as possible, and then a solid scientific conclusion can be drawn on whether a particular feature is suitable for predicting crystallization propensity of proteins.
In this context, we tested three features, which combined features possessed by both amino acids and a protein, against the crystallization propensity of proteins from Mycobacterium tuberculosis in this study, and compared with the results obtained from each of 530-plus features possessed by amino acids.
2. MATERIALS AND METHODS
428 proteins from Mycobacterium tuberculosis were found in Target DB [5 , 6] under the criterion of purified proteins, of which 277 were found under the criterion of crystallized protein. Those two criteria were used in previous studies [7 - 15]. Actually, there are many different criteria in this database as well as in other databases, but our primary interest in this study is focused on the process between purified and crystallized proteins.
2.2. Features Possessed by Amino Acid and Protein
The first feature is the amino acid distribution probability , which is based on the occupancy of subpopulations and partitions describing the distribution of elementary particles in energy states according to three assumptions with respect to whether or not to distinguish each particle and energy state, i.e. Maxwell-Boltzmann, Fermi-Dirac, and Bose-Einstein assumptions in statistical mechanism . For its application to protein, for example, Rv1875 protein has 3 tyrosines, and the simplest question is what probability it is if 3 tyrosines are clustered together or scattered along the protein sequence. This probability
can be computed according to the equation , , where ! is the
factorial, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid. For a type of amino acids, it has only one distribution probability in a protein. As amino acid composition is different, each type of amino acids has its own distribution probability. Two worked examples were listed in columns 8 and 9 of Table 2 to show the distribution probability related to each type of amino acids in proteins.
The second feature is the amino acid future composition , which comes from the observation that there are 64 RNA codons but only 20 types of amino acids, so each type of amino acids corresponds to different number of RNA codons. For example, methionine corresponds to one RNA codon (AUG), and phenylalanine corresponds to two RNA codons (UUC and UUU) whereas leucine corresponds to six RNA codons (CUA, CUC, CUG, CUU, UUA and UUG). These naturally lead to different translation probabilities when a single RNA code mutates, and consequently the probability that an amino acid mutates to another amino acid is different (Table 1). For instance, when a mutation occurs in alanine, it has 12/36 chances to mutate to alanine, 2/36 chances to mutate to both aspartic acid and glutamic acid, 4/36 chances to mutate to glycine, proline, serine, threonine, and valine, respectively. Two worked examples were listed in columns 10 and 11 of Table 2 to show the characteristic of this feature.
The third feature is the amino acid pair predictability , which is based on permutation. For instance, there are 15 leucines (L), 17 alanines (A), and 9 isoleucines (I) in Rv1155 protein. According to the permutation, the amino acid pair LA would appear twice (15/147 × 17/146 × 146 = 1.73), and there are indeed two LAs in realty so the pair LA is predictable. However, the amino acid pair IA would appear once
Table 1. Amino acids and their translated amino acids.
(9/147 × 17/146 × 146 = 1.04), but it appears three times in this protein, so the pair IA is unpredictable. In this way, all amino acid pairs are classified as 72.5% predictable and 27.5% unpredictable in Rv1155 protein.
Because all the three features are computed with the consideration on individual amino acids with their composition and/or distribution in a protein, so they possess characteristics of individual amino acid and a whole protein.
2.3. Amino Acid Features
Amino acid features are the characteristics possessed by individual amino acids, and currently a database, AAIndex, contains 540-plus amino acid features describing various aspects of amino acids , including physicochemical features, spatial features , electronic features , hydrophobic features ,
Table 2. Features for two proteins (FINA770101 is an amino acid feature that describes the helix-coil equilibrium constant).
predictors for secondary structures , etc.
Amino acid features are measured through experiments and documented so that they have no need to compute for each protein, whereas the features described in previous section need to compute for each protein. Therefore an amino acid feature is a constant for an amino acid, i.e., each feature has an unchanged value for a type of amino acid. In fact, only 531 amino acid features have 20 values for 20 types of amino acids. In this study, each amino acid feature served as a benchmark to compare with the results obtained from the features described in previous section.
Logistic regression was a major tool used in previous studies  because it works for a relationship between yes-no event and continuously numeric values, i.e. the relationship between propensity of protein crystallization, which is encoded either with amino acid features or with protein features. In this study an attempt was made to correlate each of three protein features with the crystallization propensity of proteins from Mycobacterium tuberculosis through logistic and neural network models, whose results were compared with the results obtained from modeling each of 531 amino acid features with the crystallization propensity of the proteins.
The results were classified as true positive (TP), true negative (TN), false positive (FP) and false negative (FN), so the accuracy, sensitivity and specificity can be calculated as follows [9 - 15]: TP = (TP + TN)/(TP + FP + TN + FN) × 100, TN = (TP)/(TP + FN) × 100, and FP = (TN)/(TN + FP) × 100, respectively. MatLab was used to perform both logistic regression and neural network [23 , 24]. The McNemar’s test was used to compare the classified results. The sensitivity and specificity were compared using receiver operating characteristic (ROC) analysis [25 - 28]. The Mann-Whitney U-test was used to compare predicted accuracies at different cutoff values.
3. RESULTS AND DISCUSSION
Table 2 shows differences between amino acid features and combined features. As can be seen, the amino acid feature FINA770101 that describes the helix-coil equilibrium has a constant value for each type of amino acid (columns 4 and 5) regardless of amino acid’s location, composition (columns 2 and 3), and neighboring amino acids. A simple remedy is to multiply this amino acid feature by its corresponding composition (columns 6 and 7, Table 2). By contrast, two combined features have different values for different amino acids for those two proteins (last four columns, Table 2). This is an important distinction between combined features and amino acid features, and a rationale to correlate with the crystallization propensity of proteins from Mycobacterium tuberculosis.
Figure 1 showed the comparisons of accuracy, sensitivity and specificity obtained using logistic regression to correlate the propensity of protein crystallization with each of features. In this figure, each bar represented how many features resulted in a similar accuracy, sensitivity or specificity. For example, the first bar from left-hand in the upper panel indicated that three amino acid features (CHAM830108, FAUJ880111 and MITS020101) had similar accuracies (0.643 ± 0.003). Similarly, the second bar indicated that three other amino acid features (CHAM830105, GOLD730101 and MIYS990101) had similar accuracies (0.657 ± 0.004). Figure 1 clearly showed that two combined features had a relatively good relationship with the propensity of protein crystallization. In particular, the prediction using amino acid distribution probability was the best in terms of accuracy and sensitivity.
Figure 2 displayed the comparisons of accuracy, sensitivity and specificity obtained using neural network to correlate the propensity of protein crystallization with each of features. The presentations in this figure had similar explanations as those in Figure 1. Clearly, the neural network can furthermore distinguish the difference between features. Compared against amino acid features, Figure 1 and Figure 2 suggested that two combined features not only were involved in crystallization process, but also served better for the predictions of protein crystallization. Also, many amino acid features gave similar results, being consistent with the study that demonstrated the abundance in amino acid features . In particular, Figure 2 showed that the prediction using amino acid distribution probability was the best in terms of accuracy and specificity.
In Figure 1 and Figure 2, the database was not divided, i.e. the model parameters obtained from the 428 Mycobacterium tuberculosis proteins were used for predictions. This was generally considered as the first stage in modeling, and then the database should be divided into two groups, one for the generation of model parameters while the other for the validation . Figure 3 displayed the accuracy, sensitivity and specificity obtained from delete-1 jackknife validation, which further demonstrated the predictions using combined features were not worse than those using amino acid features. In fact, Figure 3 showed that the prediction using amino acid distribution probability and future composition had the best predictions in terms of accuracy and specificity.
Table 3 listed predictive performance with respect to each feature in terms of accuracy, sensitivity and specificity. As can be seen, the best results were obtained using amino acid distribution probability, physicochemical features and second structure features.
Figure 1. Accuracy, sensitivity and specificity obtained from logistic regression between the crystallization propensity of proteins from Mycobacterium tuberculosis and each of 535 features. The 535 features are grouped according to their similarity in accuracy, sensitivity and specificity.
Figure 2. Accuracy, sensitivity and specificity obtained from fitting the relationship between the propensity of protein crystallization from Mycobacterium tuberculosis and each of 535 features using 20-1 feedforward backpropagation neural network. The 535 features are grouped according to their similarity in accuracy, sensitivity and specificity.
Figure 3. Accuracy, sensitivity and specificity of delete-1 jackknife validation obtained from modeling the relationship between crystallization propensity of proteins from Mycobacterium tuberculosis and each of 535 features using 20-1 feedforward backpropagation neural network. The 535 features are grouped according to their similarity in accuracy, sensitivity and specificity.
Table 3. Predictive performance with respect to concrete features.
Figure 4 displayed the results of ROC analysis with respect to logistic regression, fitting and delete-1 jackknife validation using 20-1 feedforward backpropagation neural network. Two points could be drawn: 1) all the features gave their classifications distributing above diagonal, i.e. the predictions were better than random chance because the McNemar’s test showed that the classified results were significantly different from those of random guess (P < 0.01), and 2) two combined features worked quite well in comparison with others.
Figure 4. Comparison of sensitivity versus specificity obtained from logistic regression and from fitting and delete-1 jackknife validation in neural network in ROC analysis. Each gray circle is a result obtained using an individual amino acid feature while each black circle is a result obtained using one of two combined features. The diagonal line is the line of indiscrimination indicating a completely random guess. The text labels are the combined features.
Furthermore, the third combined feature that is the percentage of predictable/unpredictable amino acid pairs was used to compare the accuracy for predicting the protein crystallization. Figure 5 and Figure 6 showed such analysis in both neural network fitting and delete-1 jackknife validation. First, a cutoff value of accuracy was set at 0.75, 0.80, 0.85 and 0.90 levels; Second, 428 Mycobacterium tuberculosis proteins were divided into two groups according to the above-mentioned cutoff values; Third, the predictable portions of proteins were compared between two groups. Figure 5 and Figure 6 showed that the proteins, which had a large predictable portion, provided a high accuracy of predicting their crystallization propensity.
Figure 5. Accuracy from fitting in crystallization prediction of Mycobacterium tuberculosis proteins (upper panel) and statistical comparison of their predictable portion of amino acid pairs at different cutoff values to separate proteins with accuracy (lower panel, the Mann-Whitney U-test). The data were presented as median with inter-quartiles.
Figure 6. Accuracy from delete-1 jackknife validation in crystallization prediction of Mycobacterium tuberculosis proteins (upper panel) and statistical comparison of their predictable portion of amino acid pairs at different cutoff values to separate proteins with predicted accuracy (lower panel, the Mann-Whitney U-test). The data were presented as median with inter-quartiles.
Table 4 showed the third combined feature, unpredictable portion of amino acid pairs, and predictive accuracy in all, crystallized and non-crystallization proteins from Mycobacterium tuberculosis. As can be seen in Table 4, this feature had difference between crystallized and non-crystallized proteins from Mycobacterium tuberculosis, and predictive accuracy was different between crystallized and non-crystallized proteins, too. In particular, the unpredictable portion was statistically higher in crystallized proteins than in non-crystallized ones (65.25% vs. 61.50%), while the accuracy of predictions was higher in crystallized proteins than in non-crystallized ones. However, we could not find a direct correlation between unpredictable portion and prediction accuracy.
Table 4. Unpredictable portion of amino acid pairs and accuracy of crystallization prediction in proteins from Mycobacterium tuberculosis. The data were presented as median with 25% - 75% interquartile range, and the Mann-Whitney U-test was used to determine the difference between crystallized and non-crystallized groups.
The issue of whether an amino acid or protein feature can be correlated with propensity of protein crystallization has been tested through modeling [1 , 4 , 6 , 7 , 22 , 31 - 39]. This is because it is impossible to conduct a control experiment without either amino acid or protein feature. In this study, three new features, which combined the features of individual amino acid and protein, were correlated with the crystallization propensity of proteins from Mycobacterium tuberculosis. The results demonstrate that these three combined features can be considered as the factors that affect the propensity of protein crystallization. Among three combined features, the amino acid pair predictability uses a single value, unpredictable portion, to represent a protein while the other two features, amino acid distribution probability and future composition, have each value for each type of amino acid. In this view, the amino acid distribution probability and future composition are somewhat similar to the 540-plus amino acid features, however, the two combined features do not have constant values as those amino acid features, therefore they more efficiently reflect certain features of amino acid in a whole protein. Clearly, more studies are needed to expend these three protein features to analyze the crystallization process in proteins from other organisms.
This study was supported by National Natural Science Foundation of China (31560315), and Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534).
 Kurgan, L. and Mizianty, M.J. (2009) Sequence-Based Protein Crystallization Propensity Prediction for Structural Genomics: Review and Comparative Analysis. Natural Science, 1, 93-106.
 Canaves, J.M., Page, R., Wilson, I.A. and Stevens, R.C. (2004) Protein Biophysical Properties That Correlate with Crystallization Success in Thermotoga Maritima: Maximum Clustering Strategy for Structural Genomics. Journal Molecular Biology, 344, 977-991.
 Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T. and Kanehisa, M. (2008) AAindex: Amino Acid Index Database, Progress Report 2008. Nucleic Acids Research, 36, D202-D205.
 Overton, I.M., Padovani, G., Girolami, M.A. and Barton, G.J. (2008) ParCrys: A Parzen Window Density Estimation Approach to Protein Crystallization Propensity Prediction. Bioinformatics, 24, 901-907.
 Chen, L., Oughtred, R., Berman, H.M. and Westbrook, J. (2004) TargetDB: A Target Registration Database for Structural Genomics Projects. Bioinformatics, 20, 2860-2862.
 Berman, H.M., Gabanyi, M.J., Kouranov, A., Micallef, D.I., Westbrook, J. and Protein Structure Initiative Network of Investigators. (2017) Protein Structure Initiative—TargetTrack 2000-2017—All Data Files (Data Set). Zenodo.
 Slabinski, L., Jaroszewski, L., Rodrigues, A.P.C., Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A. (2007) The Challenge of Protein Structure Determination—Lessons from Structural Genomics. Protein Science, 16, 2472-2482.
 Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A. (2007) XtalPred: A Web Server for Prediction of Protein Crystallizability. Bioinformatics, 23, 3403-3405.
 Yan, S. and Wu, G. (2012) Correlating Dynamic Amino Acid Properties with Success Rate of Crystallization of Proteins from Bacteroides vulgatus. Crystal Research and Technology, 47, 511-516.
 Yan, S. and Wu, G. (2013) Association of Combined Features of Amino Acid and Protein with Crystallization Propensity of Proteins from Cytophaga Hutchinsonii. Zeitschrift fur Kristallographie, 228, 250-254.
 Yan, S.M., Wang, H.J. and Wu, G. (2013) Correlation of Combined Features of Amino Acid and Protein with Crystallization Propensity of Proteins from Caenorhabditis elegans. Guangxi Sciences, 20, 234-243.
 Yan, S. and Wu, G. (2019) Correlation of Combined Characters of Amino Acid and Whole Protein with Success Rate of Crystallization of Lactobacillus Proteins. Journal of Biomedical Science and Engineering, 12, 245-256.
 Darby, N.J. and Creighton, T.E. (1993) Dissecting the Disulphide-Coupled Folding Pathway of Bovine Pancreatic Trypsin Inhibitor. Forming the First Disulphide Bonds in Analogues of the Reduced Protein. Journal Molecular Biology, 232, 873-896.
 Chou, P.Y. and Fasman, G.D. (1978) Prediction of Secondary Structure of Proteins from Amino Acid Sequence. Advances in Enzymology and Related Subjects of Biochemistry, 47, 45-148.
 Shaw, P.A., Pepe, M.S., Alonzo, T.A. and Etzioni, R. (2009) Methods for Assessing Improvement in Specificity when a Biomarker is Combined with a Standard Screening Test. Statistics in Biopharmaceutical Research, 1, 18-25.
 Pepe, M., Longton, G. and Janes, H. (2009) Estimation and Comparison of Receiver Operating Characteristic Curves. The Stata Journal: Promoting Communications on Statistics and Stata, 9, 1-16.
 Cai, T.X., Pepe, M.S., Zheng, Y.Y., Lumley, T., and Jenny, N.S. (2006) The Sensitivity and Specificity of Markers for Event Times. Biostatistics, 7, 182-197.
 Atchley, W.R., Zhao, J., Fernandes, A.D. and Druke, T. (2005) Solving the Protein Sequence Metric Problem. Proceedings of the National Academy of Sciences of the United States of America, 102, 6395-6400.
 Chen, K., Kurgan, L. and Rahbari, M. (2007) Prediction of Protein Crystallization Using Collocation of Amino Acid Pairs. Biochemical and Biophysical Research Communications, 355, 764-769.
 Kurgan, L., Razib, A.A., Aghakhani, S., Dick, S., Mizianty, M.J. and Jahandideh, S. (2009) CRYSTALP2: Sequence-Based Protein Crystallization Propensity Prediction. BMC Structural Biology, 9, 50.
 Elbasir, A., Moovarkumudalvan, B., Kunji, K., Kolatkar, P.R., Mall, R. and Bensmail, H. (2019) DeepCrystal: A Deep Learning Framework for Sequence-Based Protein Crystallization Prediction. Bioinformatics, 35, 2216-2225.
 Meng, F., Wang, C. and Kurgan, L. (2018) fDETECT Webserver: Fast Predictor of Propensity for Protein Production, Purification, and Crystallization. BMC Bioinformatics, 18, 580.
 Derewenda, Z.S. and Godzik, A. (2017) The “Sticky Patch” Model of Crystallization and Modification of Proteins for Enhanced Crystallizability. In: Wlodawer, A., Dauter, Z. and Jaskolski, M., Eds., Protein Crystallography. Methods in Molecular Biology, Humana Press, New York, 77-115.
 Wang, H., Feng, L., Webb, G.I., Kurgan, L., Song, J. and Lin, D. (2018) Critical Evaluation of Bioinformatics Tools for the Prediction of Protein Crystallization Propensity. Briefings in Bioinformatics, 19, 838-852.
 Wang, H., Feng, L., Zhang, Z., Webb, G.I., Lin, D. and Song, J. (2016) Crysalis: An Integrated Server for Computational Analysis and Design of Protein Crystallization. Scientific Reports, 6, 21383.