Crystallization of proteins is a very delicate process and costs time, because many known and unknown factors influence the process of crystallization. Therefore, it is hoped that a law between affecting factors and crystallization can be found to facilitate this process, i.e. to predict whether its protein is likely to be crystallized. Over last years, intensive efforts are made to search various factors, and then correlate these factors with the success rate of crystallization of proteins [1-8]. Technically, these factors should be numeric in order to correlate with success rate of crystallization. Efforts have become less impressed recently because almost all known factors have been tested without much improvement on predictions. Of tested factors, many factors are exclusively related to individual amino-acid characters, for example, molecular weight of amino acid, whereas a small number of tested factors are related to the whole protein characters, for example, the length of a protein.
Really, it is necessary to correlate the factors that combine both individual amino-acid characters and whole protein characters with the rate of protein crystallization. This is because 1) an individual amino-acid character is a fixed numerically number, for example, molecular weight, no matter whether an amino acid is in a protein or exists individually, and 2) protein characters appear simple in the previous studies. In fact, there is a combined character, i.e. the amino acid composition that represents very basic character of proteins and has been widely used in various analyses. However, new combined characters are needed in order to understand the nature of protein from different angles.
Over the last decade, we have developed three combined characters characterizing individual amino acid and protein together, and we have applied them to many different studies, for example, protein evolution, drug target designing, determination of mutation patterns, analysis of genetic disorder, protein structure and function, and prediction of mutation of influenza A viruses [9-12]. The results demonstrate the applicability and advantage of the combined characters, thus it is our desire to correlate these combined characters with the success rate of crystallization of proteins.
Technically, the relationship between various factors and success rate of crystallization of proteins was established via modeling, because it is impossible to run a control experiment without individual amino-acid characters and protein characters. So far, logistic regression was a major tool to model the relationship, because whether a protein can be crystallized is a yes-no event while protein sequences were encoded using individual amino-acid characters [4-6]. In this study, an attempt was made to test the role of combined characters in crystallization of Lactobacillus proteins via logistic regression and neural network model, whose results were compared with the results obtained from each of 531 individual amino-acid characters.
We chose Lactobacillus, not only because it is important for human health with food industrial perspective [13-15], but also because big efforts were made to crystallize its proteins. The sample of data is relatively larger than proteins from other species of interests .
2. Materials and Methods
314 proteins from Lactobacillus were found in TargetDB  under the criterion of purified proteins before 2011, of which 141 were found under the criterion of crystallized protein. Those two criteria were used in previous studies [17-22].
2.2. Combined Characters
The combined characters means that a character that combines a character of an individual amino acid and a character of a protein in terms of numerical value. For example, the molecular weight of an amino acid is a character of an individual amino acid and is unchangeable no matter where the given amino acid is located at any position in a protein. Although it is true that the molecular weight is unchangeable, the amino acid should affect the crystallization of a protein differently when it is located at different position. Similarly, the length of protein is a character associated with a whole protein, however it losses the individuality of composed amino acids, because the proteins with same length do not grantee the same crystallization propensity because they can have different amino acid compositions. So it is important to have a combined character forming from both the character of an individual amino and the character of a whole protein.
The first combined character is the amino acid distribution probability, which is based on the occupancy of subpopulations and partitions  with its online computation . Two worked examples were listed in columns 8 and 9 of Table 1 to show how this combined character is different from protein to protein.
Table 1. Comparison of characters of individual amino acid and combined character of individual amino acid and of a whole protein.
OOBM850101 is a character of individual amino acid that describes the optimized beta-structure-coil equilibrium constant. P1 and P2 are two proteins with accession number LdR34 and LpR114. The amino acid distribution probability was computed according to the equation, , where ! is the factorial, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid.
The second combined character is the amino acid future composition, which is based on the relationship between RNA codons and their translated amino acids [25-27] with its online computation . Two worked examples were listed in columns 10 and 11 of Table 1 to show how this combined character is different from protein to protein.
2.3. Characters for Comparison
A database, called AAIndex, contains more than 530 different individual amino-acid characters . Some are quite familiar to us, for examples, physicochemical properties, spatial properties , electronic properties , hydrophobic properties , predictors for secondary structures , and so on. These individual amino-acid characters are constants, i.e., each character generally has an unchangeable value for an amino acid, for example, molecular weight for alanine is 89.09. Each individual amino-acid character is put into model to predict the success rate of crystallization of Lactobacillus proteins each time for comparison with the results obtained from combined characters.
Logistic regression and 18-1 neural network were used, because the success rate of protein crystallization was a yes-no event while any character is a number for a type of amino acid, i.e. the model outcome is defined as unity when a protein can be crystallized and the model outcome is defined as zero when a protein cannot be crystallized.
MatLab was used to perform both logistic regression and neural network [33 , 34]. The results obtained from each predictor were classified as true positive (TP), true negative (TN), false positive (FP) and false negative (FN). The accuracy, sensitivity and specificity were calculated as follows: Accuracy = (TP + TN)/(TP + FP + TN + FN) × 100, Sensitivity = (TP)/(TP + FN) × 100, and Specificity = (TN)/(TN + FP) × 100. The McNemar’s test was used to compare the classified results. Sensitivity and specificity were compared using receiver operating characteristic (ROC) analysis [35-37].
3. Results and Discussion
Table 1 compares the difference between an individual amino-acid character, OOBM850101, and combined characters. No matter what an amino-acid character describes, its value for each type of amino acid is unchangeable (columns 4 and 5). This appears counter-intuitive when we use it to describe an amino acid in a protein because intuitively an amino acid should have different values in terms of different position, different neighboring amino acid and different composition. On the other hand, we can weigh an individual amino-acid character with amino acid composition (columns 6 and 7). As a result, the combined characters do have different values for the same type of amino acids when they are located at different positions, when their neighboring amino acids are different and when their number in a protein is different (last four columns). Therefore, the combined characters are more meaningful but their values have to be computed for each type of amino acid in each protein.
Figure 1 showed the results of accuracy, sensitivity and specificity obtained using logistic regression to correlate the success rate of protein crystallization with each of two combined characters and each of 533 individual amino-acid characters. In this figure: each bar represented how many characters used in predictions resulted in a similar accuracy, sensitivity and specificity. For example, the most right bars in upper, middle and lower indicated that the predictions using each of 483 individual amino-acid characters produced a similar accuracy of 0.6, the predictions using each of 488 individual amino-acid characters produced a similar sensitivity of 0.6, and the prediction using an individual amino-acid character produces the highest specificity. For another example, VENT840101 and FAUJ880112 had the accuracy of 0.53 and 0.55 in the first and second bars from left-hand in the upper panel, while the third bar indicated that three individual amino-acid characters, FAUJ880111, CHAM830107 and NOZY710101, had similar accuracies (0.58 ± 0.01). Figure 1 clearly showed that two combined characters, distribution probability and future composition, had a relative good relationship with the success rate of crystallization of Lactobacillus proteins.
Figure 1. Accuracy, sensitivity and specificity of predictions using logistic regression to model the success rate of crystallization of proteins from Lactobacillus and each of 535 characters. The text labels are the combined characters introduced in this study.
A frequent question in modeling is whether predictors result in a random prediction, which especially is the case for yes-no event prediction because yes-no event can easily connect with random tossing a coin. As good performance includes high true positive rate and low false positive rate, these render the ROC (receiver operating characteristic) analysis, where x-axis represented the false positive rate and y-axis represented the true positive rate. Figure 2 demonstrated the comparison of sensitivity versus 1-specificity obtained from logistic regression, where x-axis represented 1-specificity and y-axis represented the sensitivity. As can be seen, the ratios of sensitivity versus 1-specificity appear on upper-left area above the diagonal, indicating these characters give a good prediction. The McNemar’s test shows that such classified results are significantly different from those of random guess (P < 0.05). However, only one circle is located near the lower left corner, which resulted from an individual amino-acid character, FAUJ880112, reflecting negative charge. Thus, this individual amino-acid character, FAUJ880112, is not suitable to predict the success rate of crystallization of Lactobacillus proteins.
Figure 3 showed the results of accuracy, sensitivity and specificity obtained using 18-1 feedforward backpropagation neural network to correlate the success rate of protein crystallization with each of two combined characters and each of 533 individual amino-acid characters. Figure 3 had similar explanations and implications as those in Figure 1. Clearly, the neural network can furthermore distinguish the difference between characters for prediction of the success rate of protein crystallization. Compared against individual amino-acid characters, Figures 1-3 suggested that the two combined characters are sensitive to the crystallization process of Lactobacillus proteins. Not surprisingly, many individual amino-acid characters generated similar results, being consistent with the study showing the abundance in individual amino-acid characters .
Figure 2. Comparison of sensitivity versus specificity obtained from logistic regression in ROC analysis. Each yellow circle is a result obtained using an individual amino-acid character while each pink circle is a result obtained using one of two combined characters. The diagonal line is the line of indiscrimination indicating a completely random guess. The text labels are the combined characters introduced in this study.
Figure 3. Accuracy, sensitivity and specificity obtained using neural network to model the success rate of protein crystallization from Lactobacillus and each of 535 characters. The text labels are the combined characters introduced in this study.
For the results in Figures 1-3, the database was not divided, i.e., the model parameters obtained from the 314 Lactobacillus proteins were used for predictions. This was generally considered as the first stage in modeling, and then the database should be divided as two groups, one for the generation of model parameters while the other for the validation . Figure 4 displayed the accuracy, sensitivity and specificity obtained using delet-1 jackknife validation, which further demonstrated that the predictions using combined characters were not worse than those using individual amino-acid characters.
Figure 5 displayed the results of ROC analysis with respect to fitting and delete-1 jackknife validation using 18-1 feed forward back propagation neural network. Although the McNemar’s test shows that such classified results are significantly different from those of random guess (P < 0.05), a cluster of circles appear at the lower left corner and near the diagonal indicating that 152 individual amino-acid characters result in the sensitivity smaller than 0.5 in the fitting (upper panel of Figure 5) therefore these characters cannot be used as predictors. On the contrary, the two combined characters and other individual amino-acid characters can be used to predict the success rate of crystallization of Lactobacillus proteins.
Figure 4. Accuracy, sensitivity and specificity of delete-1 jackknife validation obtained using neural network to model the success rate of crystallization of proteins from Lactobacillus and each of 535 characters. The text labels are the combined characters introduced in this study.
Figure 5. Comparison of sensitivity versus specificity obtained from neural network in ROC analysis. Each yellow circle is a result obtained using an individual amino-acid character while each pink circle is a result obtained using one of two combined characters. The diagonal line is the line of indiscrimination indicating a completely random guess. The text labels are the combined characters introduced in this study.
Actually, the workload in this study is not small at all because the proposed combined characters has been checked against each of 532 individual amino-acid characters in order to get a solid conclusion.
The current practice on prediction of success rate of crystallization employs as many characters as possible, such as hybrid crystal growth predictive model , “sticky patch” model , theoretical underpinning using a solubility phase diagram . Therefore, we would expect that our proposed combined characters would be included in the factors, which influence the success rate of crystallization of Lactobacillus proteins.
At present, to build a predictable relationship between individual protein and its crystallization propensity is still difficult when using either logistical model or neural network model. This suggests that the more sophisticated model could be more suitable for such studies in future, for example, deep learning model. On the other hand, the introduction of cryo-electron microscopy to determine the protein 3-demensional structure reduces the demand for crystallization of proteins for X-ray crystallography , however the relationship between individual protein and its crystallization propensity is still important.
This study was supported by National Natural Science Foundation of China (31460296 and 31560315), Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534) and Special Funds for Building of Guangxi Talent Highland.