Received 21 December 2015; accepted 14 March 2016; published 17 March 2016
A protein is a linear chain of 20 amino acids, which starts with a start codon ATG, which corresponds to the amino acid methionine, followed by a sequence of amino acids and ends with a stop codon. The amino acid sequence that makes a protein is called its primary structure. Protein sequence analysis means analysis of its primary structure. It provides important insight into the structure of proteins, which in turn, greatly facilitates the understanding of its biochemical and cellular function. Efforts to use computational methods in predicting protein structure, which are based only on sequence information, started several years ago. Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences  . Existing methods for sequence comparison can be classified into alignment based methods and alignment-free methods. Alignment- based methods use dynamic programming, a regression technique that finds an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Dynamic programming method is an accurate method for comparison of two sequences. There are two types of dynamic programming: global dynamic programming which is applied for comparing sequences as a whole   and local dynamic programming which is applied to compare selected portions of the sequences  . For multiple sequence alignments (MSA) there are many methods, some of them are given in  - . In parallel there are many algorithms available for multiple sequence comparison; some of them are listed in  - . But the main difficulty in multiple sequence alignment is the computational complexity. In fact a naïve MSA takes O(lN) time for completion of the program, where N is the number of sequences for comparison, each having a length l. Therefore as an alternative to sequence based alignment, alignment free methods are chosen. Such alignment free methods are already known for comparison of DNA/RNA sequence comparison  - . Alignment free methods are based on graphical representation, which is obtained from the numerical values given to the nucleotides. Next step is to obtain the descriptors for obtaining distance matrices. One way is to characterize the graphs by obtaining directly some invariants like geometrical centres, graph radii, variances etc.  and to use these invariants as descriptors to obtain distance matrices for constructing phylogenetic trees of comparison. The other way is to transform the graphical representation into another mathematical object, a real symmetric matrix. The matrices may be of L/L, M/M or J/J types  . J/J matrix is applicable only for 3D representation. Once a real symmetric matrix M is obtained, there are different invariants associated with this matrix such as the average matrix element, the average row sum, the leading eigen value, Weiner number and the Alley index/normalized Alley index  . These may be used to obtain the distance matrix to obtain phylogenetic trees for comparison of DNA/RNA sequences. Now protein sequences are to some degree similar to DNA/RNA sequences in the sense that DNA/RNA sequences contain four nucleotides, whereas protein sequences contain twenty amino acids. Thus the graphical representation methods for comparison of DNA/RNA may be extended to protein sequences as well.
Currently, many researchers have proposed different methods for the graphical representation of protein sequences  - . In most of these existing methods, the main drawbacks are that the higher the dimension of the protein sequence graphs, the heavier the computation complexity of the methods or the lower the recognition degree of the protein sequence graphs   . In the methods proposed in  - , the main drawbacks are that the 3D graphics seem to be more complex and have lower visibility than the 2D graphics, and, in addition, to obtain the sequence invariants from the graphics, complex matrices are required to be constructed, which need much computation and storage.
Lei Wang, Hui Peng and Jinhua Zheng  proposed a novel method for analyzing the similarity/dissimilar- rity by combining the idea of the sequence alignment and the graphical representation methods to avoid the weakness of both of these two methods to some degree. Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, which has been proved to be effective in the process of protein sequence analysis  - . They used 29 different spike proteins, which are widely used as the test data  -  .
Anyway it is very difficult to make direct comparison of protein sequences owing to their very long sizes and unequal lengths. Actually a direct comparison between protein sequences would not only be tedious but would involve steps not yet fully resolved, such as how to proceed when comparing sequences of different lengths. A possible strategy to avoid such difficulties is to represent the protein sequences by suitable condensed matrices. In this way, comparison between sequences reduces to comparison between matrices. Similar reduction of DNA sequence to 4 × 4 matrices is already known  . The case was simple in the sense that it was needed to consider representation of four nucleotides only, where as in the case of protein sequence it is much more difficult to handle twenty amino acids simultaneously. To make the problem manageable, we start with a suitable numerical representation of the sequence and finally reduce it in the form of a condensed 20 × 20 matrix of unique size.
2. Methods of Constructing 20 × 20 Matrices of Protein Sequences
Construction of 20 × 20 matrix of protein sequences involves four primary steps. For the sake of convenience, we explain the steps with reference to the small sample shown in Table 1. As it contains only 9 distinct amino acids, so in this case, the final reduction is a 9 × 9 matrix.
2.1. First Step: To Calculate Distance of Each Label from the Neighboring Labels
In the first step we construct “distance” of each label from the neighboring labels of the same and different kind of amino acid. It is calculated by numbering the amino acids in the protein sequence starting from 0 (zero) for the first amino acid and starting from 1 (one) for the other amino acids. Thereby we get a 12 × 12 matrix, where 12 is the length of the protein sequence (as shown in Table 2). The entries of the matrix represent frequencies of occurrences of amino acids.
2.2. Second Step: To Group Together All Similar Amino Acids
Second step involves grouping together all similar amino acids. First of all, the amino acids are taken alphabetically as D, E, K, L, M, P, R, V, W. Next the bases of the same kind are grouped together. The elements of the matrix correspond to the serial distance of the first 12 amino acids. The rearranged 12 × 12 matrix is shown in Table 3.
2.3. Third Step: To Obtain Sub-Matrices
From Table 3, we obtain forty five sub-matrices (DD, DE, DK, DL, DM, DP, DR, DV, DW, EE, EK, EL, EM, EP, ER, EV, EW, KK, KL, KM, KP, KR, KV, KW, LL, LM, LP, LR, LV, LW, MM, MP, MR, MV, MW, PP, PR, PV, PW, RR, RV, RW, VV, VW, WW). Some of them are shown in Tables 4-6.
2.4. Fourth Step: To Obtain Final Reduced 9 × 9 Condensed Matrix
All the sub-matrices are not square matrices. So we cannot get Eigen values of all the sub-matrices. Mathematically it may be shown that the average of all the elements of a square matrix nearly approximates the highest Eigen value of that matrix. So in the third step we consider the average of all the elements of each sub-matrix
Table 1. Small sample of protein sequences.
Table 2. 12 × 12 matrix obtained by considering the 12 amino acids from the beginning of the sequence of Table 1.
Table 3. Rearranged 12 ×12 matrix.
Table 4. DE sub-matrix.
Table 5. EP sub-matrix.
Table 6. PR sub-matrix.
and finally get the 9 × 9 condensed matrix given in Table 7.
3. Construction of 20 × 20 Condensed Matrix for (HIV 1) Tat Protein
We consider the sequence of Human Immunodeficiency Virus 1 (HIV 1) Tat Protein, which has 86 amino acids. By following steps one to three as above, we get 86 × 86 rearranged matrix and calculate the sub-matrices Two such sub-matrices AA and CG are given in Table 8 and Table 9 respectively;
The final 20 × 20 condensed matrix is given in Table 10.
4. Sensitivity of the 20 × 20 Condensed Matrix
Now we change the protein sequence of Human Immunodeficiency Virus 1 a little bit. We interchange the 5th and 56th amino acid i.e. we take the 5th amino acid as R instead of D and take the 56th amino acid as D instead of R and we get the following (Table 11).
Table 7. 9 × 9 Condensed matrix.
Table 8. AA sub-matrix.
Table 9. CG sub-matrix.
The result shows that our method of constructing the 20 × 20 Condensed Matrix is highly sensitive. Little bit of change in the sequence of a protein affects the content of the final 20 × 20 condense matrix.
5. Comparison of Protein Sequences
As we have already illustrated how to reduce a protein sequence to a condensed 20 × 20 matrix, so the problem of comparison of two protein sequences reduces to the problem of comparison of two 20 × 20 matrices. In this paper we solve this problem by ALE index  .
ALE index for a matrix M is defined by
Table 10. 20 × 20 condense matrix of the protein sequence of Human Immunodeficiency Virus 1 (HIV 1) tat protein.
Table 11. Modified or changed protein sequence of Human Immunodeficiency Virus 1 (HIV 1) tat protein (length 86 amino acids).
The ALE-index is very simple for calculation so that it can be directly used to handle long sequences. If desired, one can introduce weighting procedure that will normalize magnitudes of the ALE-indices to reduce variations caused by comparison of matrices of different sizes. For instance, one can consider instead of χ a normalized ALE-index χ' = χ/n, where n is the length of the sequence and the order of the corresponding matrix as well.
6. Sequences for Comparison
We have used the NADH dehydrogenase subunit 3 (ND3), subunit 4 (ND4) and subunit 5 (ND5) protein sequences of nine species for comparison as shown in Table 14.
7. Measures of Comparison of Sequences from Reduced Matrices
First we construct 20 × 20 matrices for nine protein sequences of ND3, ND4 and ND5. Then we calculate the differences of each pair of protein sequences. For example, the difference of 20 × 20 matrices of Human and Gorilla for ND5 protein sequences is shown in Table 15. In this way we get 36 matrices for each type of protein (ND3, ND4 and ND5). Then we calculate the χ' values of 36 matrices for each type of protein (ND3, ND4 and
Table 12. 20 × 20 Condense matrix of modified or changed protein sequence of Human Immunodeficiency Virus 1 (HIV 1).
Table 13. 20 × 20 matrix of the differences of Table 8 and Table 10.
Table 14. List of nine species with their versions and lengths.
Table 15. Difference of 20 × 20 matrices of human and gorilla (ND5).
ND5). The results for ND3, ND4 and ND5 are shown in Tables 16-18 respectively. Then we construct the phylogenetic trees for each type of proteins (ND3, ND4 and ND5) for nine different species (Human, Gorilla, Common Chimpanzee, Pygmy Chimpanzee, Fin Whale, Blue Whale, Rat, Mouse and Opossum). The results are shown in Figures 1-3 respectively.
Table 16. ALE index of pair of nine species (ND3).
Table 17. ALE index of pair of nine species (ND4).
In this paper we introduce a novel characterization for Protein Sequence using condensed matrices that are based on average distances for pairs of bases obtained as quotients of sequential numbers and serial numbers in primary sequences. Such matrices not only offer some insight into the nature of the protein sequence but also allow one to make qualitative and quantitative comparisons between different sequences of proteins, whether within the same species or between different species.
The method of construction of 20 × 20 condensed matrix of the protein sequence reveals that
・ The representation of the protein sequence in 20 × 20 condensed matrix is unique.
・ The condensed form of representation may help in comparing two protein sequences of unequal lengths.
・ It is applicable to sequence of any finite length, however large it may be.
・ The phylogenetic trees (Figures 1-3) of nine species of three different types of proteins (ND3, ND4 and ND5) agree with the standard phylogenetic tree of the same species.
・ It is comparatively an easier form of comparison of protein sequences.
Table 18. ALE index of pair of nine species (ND5).
Figure 1. Phylogenetic tree of nine species (ND3).
Figure 2. Phylogenetic tree of nine species (ND4).
Figure 3. Phylogenetic tree of nine species (ND5).
Condensed matrix representation of protein sequences is a useful tool. It is applicable to comparison of protein sequences of equal or unequal lengths and of any finite size, however large it may be. It is also an accurate one in comparing the protein sequences of the aforesaid types.