CMB  Vol.10 No.3 , September 2020
Similarity Studies of Corona Viruses through Chaos Game Representation
Abstract: The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.

1. Introduction

Deoxyribonucleic Acid (DNA) is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms. As such, DNA has become a subject of both theoretical and applied studies for the last decades. DNA is a polymer of nucleotides. Nucleotides are the building blocks of DNA. The four different nucleotides of DNA are: adenine (A), cytosine (C), guanine (G), and thymine (T).

DNA sequences analysis, as one of the most important parts of bioinformatics, which was considered to reveal the essence of all life phenomenon, has been developing rapidly in recent years. Sequence comparison is crucial to understand the evolutionary relationships among organisms. Many methods have been proposed to compare genetic sequences. Traditionally, most of these approaches are the widely used alignment-based methods. In these methods, molecular sequences are optimally aligned based on selected scoring systems. The alignment-based methods often give high accuracy and may reveal the relationships among sequences. Some algorithms have been established and incorporated into software for sequence alignments. However, one of the main drawbacks of these techniques is that they are very time-consuming and expensive in memory usage. As a result, alignment-free approaches such as in [1] [2] [3] have attracted more and more attention and have been applied to biological sequence comparison as well as phylogeny analysis.

Chaos Game Representation (CGR) is an iterative system method originally proposed by Jeffery [4]. CGR is a two-dimensional plot, where the primary sequence organization of DNA is mapping using iterative function. CGR patterns of DNA segments have been proposed as a method for the classification and Identification of genomic sequences [4] - [10]. The use of CGR has mostly been restricted to a visualization tool representing nucleotide sequences, in which patterns like over-or underrepresentation of nucleotides, dinucleotides, trinucleotides, etc. can be visually ascribed. Goldman concluded that the patterns exhibited by CGR are sufficient to evaluate word length composition of three, i.e., the frequencies of nucleotides, dinucleotides and trinucleotides [5]. However, it was shown later that longer oligonucleotide frequencies also influence the patterns seen in CGR [9]. Later, a spectrum of word lengths, in addition to nucleotide and dinucleotide, in CGRs were identified as factors that can differentiate between genomes of different species. Several distance measures were proposed to compare two or more CGRs and it was employed for studying phylogenetic relationships among diverse species [7] [8] [9] [10] [11]. However, it is not clear if intra-species genomic variability, which is much less than between-species variation, can be resolved using CGRs with similar word lengths. Later it was found in [12], the value k = 7 achieved the highest accuracy scores for HIV-1 subtypes classification.

The motivation of writing this paper came from the recent outbreak of novel coronavirus (SARS-Cov-2) now known as Covid-19. SARS-CoV-2 is the third pathogenic novel coronavirus to emerge over the past two decades. The first, discovered in 2003 and named SARS-CoV, caused SARS, a serious and atypical pneumonia. The second, MERS-CoV, emerged a decade later in the Middle East and caused a similar respiratory ailment called Middle East respiratory syndrome (MERS). Since its identification, 2494 cases of MERS-CoV infection and nearly 900 deaths have been documented [13]. The SARS-CoV epidemic proved larger but less deadly, with approximately 8000 cases and nearly 800 deaths [14]. There are other four coronaviruses that cause colds in humans—known as HCoV-229E, HCoV-NL63, HCoV-OC43 and HCoV-HKU1 [15].

In this paper, we proposed two methods, i.e., probability matrix method and centroid matrix method combining with CGR to construct distancematrix between two genomes, and then create dendrogram using Hierarchical Agglomerative Clustering (HAC) analysis. Our dendrogram can accurately identify the genetic relationship of different biology, and this method is generally applicable to various organisms.

2. Methods

In this section we first describe the dataset used for our analysis, then present an overview of the three main steps of the method and conclude with a description of the two distances that we considered.

2.1. Dataset

Data acquisition: All viral sequences downloaded in FASTA format from two databases for our analysis: NCBI ( and GISAID (

For our experiment, we used only complete genomes of 15 corona viruses as it is given in Table 1.

2.2. Overview

The method we used to analyze and classify the 15 sequences of the dataset has three steps: 1) generate graphical representations (images) of each DNA sequence using CGR and define FCGR probability matrix and CGR centroid method using the features of CGR; 2) compute all pairwise distance to obtain two distance matrices; and 3) create the dendrogram of the distance matrix using Hierarchical Agglomerative Clustering (HAC) analysis.

Table 1. Dataset for the experiment.

CGR is an iterative method introduced by Jeffery [4] to visualize the structure of a DNA sequence. A CGR associates an image to each DNA sequence as follows: starting from a square with corner labeled four nucleotides C, G, A and T, and the center of the square as the starting point, the image is obtained by successively plotting nucleotide as the middle point between the current point and the corner labeled by the nucleotide to be plotted. If the generated square image has a size of 2 k × 2 k pixels, then every pixel represents a distinct k-mer: A pixel is color red if the k-mer it represents appears in the DNA sequence, otherwise it is white. CGR images of generating DNA sequences coming from various species show pattern such as squares, parallel lines, rectangles, triangles, and also complex fractal patterns. We have created CGR of all 15 virus genomes and visually they look similar (see Figure 1 below).

For step (1), we will use a slight modification version of the original CGR, introduced in [9]: a k-th order FCGR (Frequency Chaos Game Representation) is a 2 k × 2 k matrix that can be constructed by dividing the CGR plot into a 2 k × 2 k grid, and defining the element | a i j | is as the number of points that are situated in the corresponding grid square. A first-order FCGR and a second-order FCGR have the structure shown below, where N w is the number of occurrences of the k-mer w, in the sequence s is

F C G R 1 ( s ) = ( N C N G N A N T ) and F C G R 2 ( s ) = ( N C C N G C N C G N G G N A C N T C N A G N T G N C A N G A N C T N G T N A A N T A N A T N T T ) .

The (k + 1)th order F C G R k + 1 ( s ) can be obtained by replacing each element N X in F C G R k ( s ) with four elements ( N C X N G X N A X N T X ) where X is a sequence

of length k over the alphabet {A, C, G, T}. For each k 1 , we can define a probability matrix of F C G R k ( s ) by taking each entry of F C G R k ( s ) dividing by the total counts of all k-mers. We denote the FCGR probability matrix by ( P i j ) , 1 i , j 2 k . Note that i , j P i j = 1 . Probability matrix can be interpreted as probability of distribution.

Since the CGR captures the information of the whole genome data, extracting the global features from the CGR may not be efficient enough to distinguish the genomes. In CGR Centroid method, we concentrate on extracting the local features as shown in [11]. We partition the CGR into sub-regions so that it reveals local information of the interested areas. If two dots are within the same quadrant, they correspond to sequences with the same last mononucleotide; if they are in the same sub-quadrant, the sequences have the same last dinucleotides; and so on. This can demonstrate the structure of the sequences yielding the points in the CGR. Chaos Centroid method utilizes this biological significance by computing the centroid of the distributed points of each sub-region.

Figure 1. CGR images of all fifteen coronaviruses listed in Table 1.

For Chaos Centroid method, the CGR is partitioned into 10 × 10 equal subregion. The choice of 10 is to minimize the computation time. For each partition, we compute the centroidas follows. Let ( x k , y k ) be the coordinates of a point in the CGR. We define the centroid in each of the 10 × 10 grid as follows:

c i j = ( k = 1 | a i j | x k | a i j | , k = 1 | a i j | y k | a i j | ) , 1 i , j 10 .

For step (2), after computing FCGR probability matrices and computing centroid for each of the sequences in the dataset, the goal was to measure “distance” between two CGR images. There are many distances as it is given in [10] [11] that can be defined for our purpose. One of the goals of this study was to identify what distance is better able to differentiate the structural differences of various genomic DNA sequences. In this paper we use two different distances: FCGR Probability Matrix distance and CGR Centroid distance. Both use the Euclidean distance. For step (3), after computing all pairwise distances we obtained two different distance matrices. Then, we created the dendrogram of the distance matrices using Hierarchical Agglomerative Clustering (HAC) analysis.

2.3. Distances

In this section we formally define each of the two distances. For two FCGR probability matrices ( p i j ) and ( p i j ) we define d i j = | p i j p i j | . The distance

between the two probability matrices denoted by D P M = i = 1 2 k j = 1 2 k d i j . For two

genomes, we calculate 100 centroids c i j = ( x i j , y i j ) and c i j = ( x i j , y i j ) respectively for 1 i , j 10 . Then we found Euclidean distance between them

d i j = ( x i j x i j ) 2 + ( y i j y i j ) 2 . Then calculated the centroid distance between two genomes denoted by D c d = i = 1 10 j = 1 10 d i j .

3. Results

For our dataset we used k = 7, that is, each DNA sequence represented as a 2 7 × 2 7 FCGR matrix. In [12], it was found highest accuracy in HIV-1 classification and this value is being used here as it is relevant for our viral analysis. Table 2 display the pairwise distance among 15-virus genomes in the dataset using probability matrix distance while Table 3 display the same using centroid distance.

Table 2. Probability distance matrix of 15 viruses listed in Table 1.

Table 3. CGR Centroid distance matrix of 15 viruses listed in Table 1.

Figure 2 shows the phylogenetic tree obtained using Table 2 distances by python Hierarchical Agglomerative Clustering (HAC) analysis. Similarly Figure 3 shows the phylogenetic tree using Table 3. Figure 4 is the Neighbor Joining Phylogenetic tree using traditional Clustal X method.

From Figure 2 and Figure 4, we can see that the cluster results between Clustal X method and probability distance method are essentially same. Similar Phylogenetic analysis of bat coronaviruses with other coronaviruses and the phylogenetic tree was constructed using Clustal W also done in [15].

Figure 2. HAC phylogenetic tree using probability matrix distance.

Figure 3. HAC phylogenetic tree using CGR centroid distance.

Figure 4. Phylogenetic Tree was created by Clustal X by aligning 15 DNA sequences using Neighborhood Joining Method.

All sequence data contain inherent information that can be measured by Shannon’s uncertainty theory. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis as seen in [16]. We started with 57 genome sequences and then reduced to 15 based on the Shannon Entropy and Shannon Entropy of 7-mers of the sequences, see Figure 5 and Figure 6. All Covid-19 sequences have entropy close to 1.957. We choose only six Covid-19 sequences in the dataset along with all other sequences with deviated entropy from 1.957 for our analysis of corona viruses.

Figure 5. Shannon Entropy of 57-virus genomes.

Figure 6. 7-mers Shannon Entropy of 57 virus sequences.

4. Discussion and Conclusions

Our methods are comparable to many other alignment-free methods as shown in [1] [2] [3] [11]. The proposed methods i.e. FCGR Probability and Chaos Centroid, are based on Chaos game representation, which provides a unique and scale-independent representation of DNA sequences through the statistical distribution of k-mers along DNA sequences. An advantage of CGR over alignment is that it has the potential to reveal the evolutionary and/or functional relationships between the sequences having no significant homology, as explained in [17]. Furthermore, it does not require prior knowledge of consensus sequences, nor does it involve exhaustive searches for sequences in databases. The limitation of CGR is that it takes a computational time to generate the representations from DNA sequences.

In conclusion, results show that our method can accurately classify different genomic sequences. In terms of classification accuracy, our method is basically the same as the state-of-the art Clustal X and compare with the traditional Clustal X phylogenetic tree construction method [18], our method is much faster. Furthermore, our dendrogram construct method can be widely applicable for various kinds of organisms. This research may contribute to reveal the biological evolution process to some extent, as well as promote the further development of bioinformatics. We may make efforts in our future work to provide a webserver for the methods presented in this paper. All the codes in this paper are written in python and can be available upon request.


This work was done while D.C.S. mentored two undergraduate students M. D. H. and K. R. B. However, M. D. H. was partially funded by NIH-Minority Access to Research Careers (MARC) program (Grant #T34-GM10083) in Spring 2020 and NSF-Louis Stokes Alliances for Minority Participation (LSAMP) Program (Grant #NSF-1712724) in Summer 2020.

Cite this paper: Sengupta, D. , Hill, M. , Benton, K. and Banerjee, H. (2020) Similarity Studies of Corona Viruses through Chaos Game Representation. Computational Molecular Bioscience, 10, 61-72. doi: 10.4236/cmb.2020.103004.

[1]   Ni, H.M., Qi, D.W. and Mu, H. (2018) Applying MSSIM Combined Chaos Game Representation to Genome Sequences Analysis. Genomics, 110, 180-190.

[2]   Stan, C., Cristescu, C.P. and Scarlat, E.I. (2010) Similarity Analysis for DNA Sequences Based on Chaos Game Representation. Case Study: The Albumin. Journal of Theoretical Biology, 267, 513-518.

[3]   Li, Y., He, L., et al. (2017) A Novel Fast Vector Method for Genetic Sequence Comparison. Scientific Reports, 7, Article No. 12226.

[4]   Jeffrey, H.J. (1990) Chaos Game Representation of Gene Structure. Nucleic Acids Research, 18, 2163-2170.

[5]   Goldman, N. (1993) Nucleotide, Dinucleotide and Trinucleotide Frequencies Explain Patterns Observed in Chaos Game Representations of DNA Sequences. Nucleic Acids Research, 21, 2487-2491.

[6]   Kari, L., Hill, K.A., Sayem, A.S., Karamichalis, R., Bryans, N., et al. (2015) Mapping the Space of Genomic Signatures. PLoS ONE, 10, e0119815.

[7]   Almeida, J.S., Carriço, J.A., Maretzek, A., Noble, P.A. and Fletcher, M. (2001) Analysis of Genomic Sequences by Chaos Game Representation. Bioinformatics, 17, 429-437.

[8]   Wang, Y., Hill, K., Singh, S. and Kari, L. (2005) The Spectrum of Genomic Signatures: From Di-Nucleotides to Chaos Game Representation. Gene, 346, 173-185.

[9]   Deschavanne, P., Giron, A., Vilain, J., Fagot, G. and Fertil, B. (1999) Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences. Molecular Biology and Evolution, 16, 1391-1399.

[10]   Karamichalis, R., Kari, L., Konstantinidis, S., et al. (2015) An Investigation into Inter- and Intra-Genomic Variations of Graphic Genomic Signatures. BMC Bioinformatics, 16, Article No. 246.

[11]   Tanchotsrinon, W., Lursinsap, C. and Poovorawan, Y. (2015) A High Performance Prediction of HPV Genotypes by Chaos Game Representation and Singular Value Decomposition. BMC Bioinformatics, 16, 71.

[12]   Solis-Reyes, S., Avino, M. and Poon, A. (2018) An Open-Source k-mer Based Machine Learning Tool for Fast and Accurate Subtyping of HIV-1 Genomes. PLoS ONE, 13, e0206409.

[13]   WHO (2019) Middle East Respiratory Syndrome Coronavirus (MERS-CoV).

[14]   WHO (2020) Summary Table of SARS Cases by Country, November 1, 2002-August 7, 2003.

[15]   Hu, B., Ge, X., Wang, L., et al. (2015) Bat Origin of Human Coronaviruses. Virology Journal, 12, 221.

[16]   Akhter, S., Bailey, B., Salamon, P., et al. (2013) Applying Shannon’s Information Theory to Bacterial and Phage Genomes and Metagenomes. Scientific Reports, 3, Article No. 1033.

[17]   Basu, S., Pan, A., Dutta, C. and Das, J. (1997) Chaos Game Representation of Proteins. Journal of Molecular Graphics and Modelling, 15, 279-289.

[18]   Larkin, M.A., et al. (2007) Clustal W and Clustal X Version 2.0. Bioinformatics, 23, 2947-2948.