Received 14 February 2016; accepted 27 June 2016; published 30 June 2016
Among the numerous available amino acids only 20 are generally found in living beings and every protein sequence is expressed by these 20 amino acids. The representation of protein in terms of its amino acids is called its primary sequence. Based on this primary sequence representation, protein sequence comparison involves basically two types of methods: 1) Alignment Based Method and 2) Alignment Free Method. Protein sequence comparison was primarily done by different alignment based methods  -  . But especially due to execution time and comparatively difficult procedure, alignment free methods were preferred subsequently. So far as alignment free methods are concerned, a good literature up to 2003 is available in  . So we start with highlighting some of the most important contribution in protein sequence comparisons by alignment free methods from 2003 onwards  -  . Obviously in most cases, protein sequence comparison also follows similar approach as is considered in genome sequence analysis, because the role of four nucleotides is the same as the role of 20 amino acids in a protein sequence. In details, first of all, numerical representations of the protein sequences are obtained from the numerical values given to the individual amino acids, then graphical representation of the protein sequences is obtained; from these graphs descriptors are derived. These are finally used in comparing protein sequences. All the papers from  -  involve graphical representations. But another completely different approach is also followed in protein sequence comparison. These are based on classification of amino acids in different groups with different cardinality    . Again application of Discrete Fourier Transform in Bioinformatics is also well known. Discrete Fourier Transform (DFT) is nicely used in signal and image processing  -  . The main areas of its application in DNA research are found in gene prediction, hierarchical analysis and such others  -  . It is effectively used in identification of protein coding regions, because a DFT spectrum of a DNA sequence reflects the distribution and periodic pattern of the sequence  . Use of DFT on binary sequence is found in  , where the binary sequence is generated from genome sequences by Voss type of representation. Naturally to find similar use of DFT in protein sequence analysis, corresponding Voss type representation of amino acids is to be known priori. Fortunately Voss representation of DNA sequences involving 4 nucleotides has already been generalized to Voss type representation of 20 amino acids in protein sequences  . Such representation of amino acids has already been used in obtaining fuzzy representation of amino acids  . These are found to be effective in classification of amino acids in 6 different groups. Finally protein sequence classification has been obtained based on such classified groups of amino acids   -   . Thus Voss type representation of amino acids is an important contribution in protein sequence analysis. But use of FFT on the binary representations of protein sequences generated by such Voss type representation of amino acids has not yet been attempted in protein sequence comparison. This is the motivation of the paper to consider such binary sequences in comparing protein sequences.
2.1. Voss Type Binary Representation of Amino Acid
20 amino acids are taken in the following order:
Alanine (A), Cysteine (C), Aspartic acid (D), Glutamic acid (E), Phenylalanine (F), Glycine (G), Histidine (H), Isoleucine (I), Lysine (K), Leucine (L), Methionine (M), Asparagine (N), Proline (P), Glutamine (Q), Arginine (R), Serine (S), Tyrosine (T), Valine (V), Tryptophan (W) and Threonine (Y).
Each amino acid is represented by a 20 component vector of which one bit is 1 and others are 0. But the representation follows the order of amino acid taken. For example amino acid Alanine(a) is represented by 10000000000000000000. The same rule is applied for other amino acids also, so that the last amino acid Threonine (Y) is represented by 00000000000000000001.
From each protein sequence S we get 20 different representations corresponding to 20 different amino acids by putting in the protein sequence 1 for the particular amino acid considered and the rest all 0 for the remaining amino acids. Thus 20 different binary representations viz., UA, UC, UD, UE, UF, UG, UH, UI, UK, UL, UM, UN, UP, UQ, UR, US, UT, UV, UW and UY are obtained.
2.2. ICD Method for Protein Sequence Analysis
The ICD method of DNA sequence and Protein sequence analysis basically remains the same as both deals with binary sequence only. So we describe ICD method as described in  . First of all FFT is applied on the binary represented protein sequences of length N say. In the Fourier spectrum the amplitudes are taken, which are N/2 distinct numbers. We normalize these N/2 components by their lengths. On these N/2 normalized components, we take absolute value of the inter coefficient difference (ICD) by calculating the differences of the succeeding terms from the preceding ones. Thus we get (N/2 − 1) distinct elements corresponding to each amino acid. Now 20 such (N/2) − 1 distinct components are concatenated to give a descriptor of length 20*((N/2) − 1)). From such descriptors distance matrix is formed by considering Euclidian Distance measures as follows.
If and are two sequences for two proteins X and Y, then the distance between X and Y is given by
This is the Euclidean distance between X and Y. The smaller is the distance; more similar are the protein sequences. On the basis of this formula the distances between pair of proteins are calculated and they are used to form the diagonal matrix. Due to similarity, only the lower half of the matrix is taken. Now using the UPGMA software on this matrix the Phylogenetic Tree for all the species is obtained. For comparison of protein sequences of different lengths the question of making all the lengths same does not arise normally in FFT. But if necessary, the length may be manually adjusted by putting additional zeros. For example, suppose two protein sequences are of lengths M and N. Then the descriptors for the first and second sequences are of lengths 20*((N/2) − 1) and 20*((M/2) − 1) respectively. As the descriptors are of unequal lengths, so comparison becomes infeasible. Hence if M = N − 2, say, then we first make the lengths of both the sequences equal to N, by putting two additional zeros to the second sequence. But there is no problem in doing so, as the Fourier transform of zeros gives zero spectrum.
3. Sequences for Comparison
We have used the NADH dehydrogenase subunit 5 (ND5) and subunit 6 (ND6) protein sequences of nine species for comparison as shown in Table 1.
4. Results and Discussions
Distance matrix obtained by applying our method for 9 protein sequences of ND5 and ND6 category have been presented in Table 2 and Table 3 respectively. Phylogenetic tree obtained from these data have been presented in Figure 1 and Figure 2 for ND5 and ND6 category respectively.
ICD method, which is dependent on Voss type representation of DNA sequences, is already known to be very much successful in comparing DNA sequences. Voss type representation for protein sequences is comparatively a newer concept. As Voss type representation for protein sequences has been applied recently in different areas and found to be very much successful there, so it is expected that this type of representation might be useful in protein sequence comparison also. This is why; in our paper ICD method based on Voss type representation for protein sequences has been developed and used for protein sequence comparison. No doubt that the present method is a new contribution to the literature of protein sequence comparison.
Table 1. List of nine species with their versions and lengths.
Figure 1. Phylogenetic tree obtained for 9 protein sequences of ND5 category.
Figure 2. Phylogenetic tree obtained for 9 protein sequences of ND6 category.
Table 2. Distance matrix (lower triangular) for 9 protein sequences of ND5 category.
Obviously ICD method, may be for DNA sequence comparison or Protein sequence comparison, is comparatively easier and straight forward to apply.
To compare our results with those obtained earlier by other methods on the same species, we first mention them as far as possible. The phylogenetic tree obtained in  for 9 species of ND5 category is given in Figure 3. Similarly the phylogenetic trees obtained in  for 9 species of ND5 category and ND6 category are given in Figure 4 and Figure 5 respectively and the phylogenetic trees obtained in  for 9 species of ND5 category and ND6 category are given in Figure 6 and Figure 7 respectively.
Figure 3. Phylogenetic tree obtained in  for 9 species of ND5 category.
Figure 4. Phylogenetic tree obtained in  for 9 species of ND5 category.
Table 3. Distance matrix (lower triangular) for 9 protein sequences of ND6 category.
Figure 5. Phylogenetic tree obtained in  for 9 species of ND6 category.
Figure 6. Phylogenetic tree obtained in  for 9 species of ND5 category.
Figure 7. Phylogenetic tree obtained in  for 9 species of ND6 category.
From the above phylogenetic trees obtained for ND5 and ND6 categories of protein, it is revealed that in both the cases the phylogenetic trees obtained by our method almost agree with the earlier phylogenetic trees obtained by other methods.
Our method is effective and easier to apply in protein sequence comparison.