There is a long-held paradigm that the bacterial genome is comprised of a single circular chromosome and additional dispensable non-essential plasmid(s). This view has been revised because of the existence of multiple chromosomes in several bacterial species including Rhodobacter sphaeroides 2.4.1   , Brucella melitensis 16M  , Agrobacterium tumefaciens C58  , Leptospira interrogans Verdun and RZ11  , Paracoccus denitrificans Pd1222  , and Vibrio cholerae AI1837  . The explosion and impact of genome sequencing technology   further illuminated the presence of accessory chromosomes in ten percent of the total sequenced bacterial genomes in the NCBI database, and these species were widely distributed throughout different lineages  . Therefore, the existence of multiple chromosomes is now an accepted paradigm of the bacterial genome structure.
Three strains of R. sphaeroides (2.4.1, ATCC 17025, and ATCC 17029) have been previously examined for their sequence divergences and the result indicated that genes on CIIs have lower nucleotide identity, which suggests the rapid evolution of CIIs in R. sphaeroides  . The study was further expanded among strains of both within-species and between-species of Proteobacteria, including species of α-, β-, and γ-Proteobacteria  . Out of the ten within-species comparisons, eight organisms displayed higher levels of CII-specific DNA sequence divergences. Also, trans-genera pairwise comparisons indicated an average of ~42% nucleotide identity among CIs and a significantly lower nucleotide identity of ~27% among CIIs. The previous studies further demonstrated a high divergence and rapid evolution of CII-specific sequences compared with CI-specific sequences in a majority of bacterial species  .
The current study tests the following four hypotheses. 1) CIIs exhibit lower sequence conservation and sequence divergence compared to their corresponding CIs across species of Proteobacteria. 2) The differential sequence divergence of CI and CII depends on pathogenic and non-pathogenic lifestyles. 3) CIIs harbor a higher level of HGTs than CIs; thus they diverge more rapidly. 4) Orthologs located on CIIs experience less purifying selection than their corresponding those on CIs.
A total of 83 strains were analyzed within 16 individual species comparisons to test the above hypotheses. This study employs a variety of genomic and bioinformatics tools and methods, which include Mauve  , Island Viewer  , GEMINI  , BLAST   , KaKs_Calculator  , MUSCLE  , and PAL2NAL  .
2. Materials and Methods
2.1. Comparative Genome Alignment
One hundred bacterial strains (see Supplementary Table S1) with multipartite genomes were identified using the National Center for Biotechnology Information (NCBI) database, of which 83 of the strains belong to the Proteobacterial phylum. A total of 16 species that contain two or more strains and represent α-, β-, γ-subgroups of Proteobacteria were selected for whole genome alignment. They include 11 pathogenic and 5 non-pathogenic species (see Supplementary Table S2).
DNA sequence files (in FASTA format) for CI and CII from 50 strains were downloaded from the NCBI database. Mauve 2.3.1 was utilized to obtain whole genome alignments of the 16 Proteobacterial species CIs and CIIs, separately. Both percentage conservation from the local collinear blocks (LCBs) and percentage nucleotide identity from the entire conserved regions were extracted. A paired t-test was performed to assess the significant difference in percent conservation and percent nucleotide identity between CIs and CIIs. These results were further classified by subgroup and lifestyle. Again, a paired t-test was used to estimate the significant difference in percent conservation and percent nucleotide identity between CIs and CIIs.
2.2. Estimation of HGT
HGTs in CIs and CIIs were estimated using Island Viewer  which identifies genomic islands (GI) using three methods. Two of the three are sequence composition-based methods, namely, SIGI-HMM and Island Path-DIMOB. The third one, Island Pick, is a comparative genomics method that assesses putative islands by performing alignment of closely related genomes using Mauve  . SIGI-HMM, based on a hidden Markov model (HMM), exploits codon usage bias to discriminate between the native and GI sequences. Island Path-DIMOB recognizes GIs by common characteristics, such as mobility genes (e.g. integrases, transposases) and anomalous sequence composition. Island Viewer takes an integrated approach to increase the accuracy; it also allows access to results from each component method.
A recently published genome-mining tool, called GEMINI  , based on a recursive segmentation and clustering procedure, was utilized to quantify HGTs. The framework recursively segments a genome sequence into two parts at each position where the compositional difference between the resulting sequence segments is maximized in terms of information-theoretic measure. The tool separates the horizontally acquired DNA segments from the vertically inherited DNA segments in a genome, performs a hypothesis testing of whole genome segmentation, and recursively applies the process for each resulting segment, until each segment becomes compositionally homogeneous. CIs and CIIs in the 83 Proteobacterial species were evaluated independently for the HGT segments and a paired t-test was performed to assess any significant difference in the HGT estimates between CIs and CIIs.
2.3. Selective Constraints on Genes in Multipartite Genomes
To quantify the evolutionary divergence of orthologs shared by multiple strains of the same species, BLASTP was employed to compare genes of CI and CII separately within and across species. A total of the 50 strains representing the 16 species within α-, β-, and γ-subgroups were analyzed for further determination of the selective pressures on CI and CII, separately. Each comparison began by pre-processing which gives specific instruction for file extraction, preparation, processing, and storage. The query and target sequences in GenBank flat file format (.gbk) were downloaded from the NCBI database and converted to FASTA (.faa) format. The target files in FASTA format were organized into a local database for sequence comparison. Then, BLASTP was initiated and the query sequence was compared to all sequences in the target local database. Post-processing converted aligned results from the previous step into files available for further analysis. Next, orthologs for CIs and CIIs were identified with BLAST E-values < 10ˉ20 and aminoacid identity > 30%. Orthologs satisfying these criteria were used to determine whether CI and CII have different modes of selection.
The rates of non-synonymous substitution (Ka), synonymous substitution (Ks), and the selective constraints (ω = Ka/Ks) were calculated using KaKs_Calculator  , which requires DNA sequences and computes Ka, Ks, and ω, considering biases in transition to transversion ratio and the codon redundancy. The amino acid sequences corresponding to their nucleic acid sequences of orthologs were aligned using MUSCLE  and then reverse translated to DNA sequences using PAL2NAL  . The resulting nucleic acid sequences were used as an input to the KaKs_Calculator, and then Ka, Ks, and ω were calculated. Negative selection is inferred if ω < 0.3, implying that the orthologs likely have maintained identical function. Neutral selection is inferred if 0.3 < ω < 3, suggesting orthologues are nearly equivalent and more likely to perform similar functions. Positive selection is inferred if ω > 3, allowing new gene functions to evolve.
3.1. Multipartite Genome Evolution in Prokaryotes
Chromosome size, gene count, and gene density (genes/kb) on CIs, CIIs, and CIIIs of the 100 bacterial strains are shown in Supplemental Table S1. The α-proteobacterial subgroup has 27 strains representing 17 species of eight different genera. The β-subgroup has 35 strains distributed over 19 species within four genera. The γ-subgroup has 21 strains consisting of 13 species within three genera. Of the 83 strains in the three subgroups of Proteobacteria, 50 strains, including 35 pathogenic strains representing 11 species and 15 non-pathogenic strains representing five species, are listed in Supplemental Table S2(a) and Supplemental Table S2(b).
Whole genome alignments for the related strains of the species within each α-, β-, and γ-subgroup in Proteobacteria were examined. As an example of visual representation, the whole genome alignment of CIs and CIIs for the four strains of R. sphaeroides is depicted in Figure 1. The colored blocks represent local collinear blocks (LCBs), which are areas of conserved regions between four strains and the vertical lines link each matching LCB displaying large-scale chromosomal alterations, such as large deletions, insertions, translocations, and inversions. The alignment results indicate 80% conservation of CIs, while 27% sequence conservation of CIIs among all the four strains of R. sphaeroides. Also, inversions, deletions, and translocations are more abundant on CIIs, possibly contributing to the rapid divergence of CIIs in R. sphaeroides.
The whole genome alignment results, including the percent conserved regions and percent nucleotide identity for CIs and CIIs in the 50 strains within 16 species, are depicted in Figure 2(a) and Figure 2(b). Also, the data associated with
Figure 1. A visual representation of CIs (top panel) and CIIs (bottom panel) alignment of the four strains (2.4.1, ATCC 17029, KD131, and ATCC 17025) of R. sphaeroides. Conserved local collinear blocks (LCBs) are displayed as colored boxes among the four strains and the vertical lines link each matching LCB displaying large-scale deletions, insertions, translocations, and inversions.
Figure 2. (a) Percent conservation of CI (in blue) and CII (in green) among the 11 pathogenic species (left) and the 5 non-pathogenic species (right). The whole genome alignment for the pathogenic strains was completed using 35 sequenced multipartite genomes within 4 genera of α-, β-, and γ-subgroups of Proteobacteria. Also, the whole genome alignment for the pathogenic strains was completed using 15 sequenced multipartite genomes within 4 genera. The number of strains used in each species is specified in parentheses.; (b) Percent nucleotide identity of CI and CII within the total conserved regions (Local Collinear Blocks, LCBs) among the 11 pathogenic species (left) and the 5 non-pathogenic species (right). The whole genome alignment for the pathogenic strains was completed using 35 sequenced multipartite genomes within 4 genera of α-, β-, and γ-subgroups of Proteobacteria. Also, the whole genome alignment for the pathogenic strains was completed using 15 sequenced multipartite genomes within 4 genera. The number of strains used in each species is specified in parentheses.
the figures are given in Supplemental Table S2(a) and Table S2(b). Figure 2 exhibits that of the 16 whole genome alignments, 11 CI alignments show higher % conservation than CII. To be more specific, 4 out of 5 species in α-subgroup have higher % conservation in CI than in CII; 4 out of 7 species in β-subgroup have higher % conservation in CI than in CII, and 3 out of 4 species in γ-subgroup have higher % conservation on CI than CII. A paired t-test was performed on all the 16 alignments, and there was a significant difference (P-Value = 0.037) in the total conserved regions on CIs compared to those on CIIs.
3.2. Sequence Conservation in Nucleotide Identity between Pathogenic and Non-Pathogenic Life-Styles
The impact of lifestyle on sequence conservation was also examined by classifying the species as pathogens and non-pathogens. The 11 pathogenic strains in Figure 2(a) show that sequence conservation between CIs and CIIs was not significantly different (Paired t-test, P-Value = 0.690). However, among the five non-pathogenic strains also shown in Figure 2(a), sequence conservation for CIs were significantly higher than that for CIIs (Paired t-test, P-Value = 0.016). These results indicate that CIIs in non-pathogenic strains within α-, β-, and γ-subgroups in Proteobacteria have diverged more rapidly than CIs.
All the 16 whole genome alignments also revealed that the percent nucleotide identity of the conserved regions in CII was significantly lower than that of the CI conserved regions (Paired t-test, P-Value = 0.036), as shown in Figure 2(b). The percent nucleotide identity between pathogenic and non-pathogenic species was shown in Figure 2(b). In the pathogenic bacterial strains, percent nucleotide identity for CIs is not significantly different from that for CIIs (Paired t-test, P-Value = 0.437). However, percent nucleotide identity for CIs is significantly higher than that for CIIs in the non-pathogenic bacteria (Paired t-test, P-Value = 0.005).
3.3. Horizontal Gene Transfer Estimates for CIs and CIIs
Island Viewer estimates of HGT among the 83 strains within the α-, β-, and γ-subgroups of Proteobacteria are listed in Supplemental Table S3. There is no significant difference in HGT estimates between CIs and CIIs (Paired t-test, P-Value = 0.790). In each subgroup, the following number of strains had higher HGT estimates on CI compared to CII: 17 out of 27 strains in α-subgroup, 19 out of 35 strains in β-subgroup, and 12 out of 21 strains in γ-subgroup. The result of paired t-test indicated no significant difference in the HGT levels between CIs and CIIs within every subgroup as follows: α-subgroup (P-Value = 0.707), β-subgroup (P-Value = 0.793), and γ-subgroup (P-Value = 0.857). For both α- and β-subgroups, the average HGT estimates are slightly higher in CIs than in CIIs. However, in γ-subgroup, the average HGT estimate for CIs is slightly lower than that for CIIs, but not significantly different. Furthermore, HGT estimates for the 83 strains were compared between the two subcategories: pathogens and non-pathogens. There was no significant difference in the HGT estimate between CIs and CIIs for the pathogenic strains (P-Value = 0.407) and the non-pathogenic strains (P-Value = 0.544).
The HGT estimates for CI and CII of R. sphaeroides were obtained from a recently proposed integrated segmentation and clustering method, GEMINI and they are schematically represented in Figure 3. Furthermore, the different HGT estimates of α-, β-, and γ-subgroups of Proteobacteria are shown in Figure 4. A total of 48 out of 76 strains had higher HGT estimates (% nucleotides labeled “alien”) on CIs than those on CIIs as follows: α-subgroup (16 out of 24 strains), β-subgroup (21 out of 34 strains), and γ-subgroup (11 out of 18 strains). A paired t-test result indicated no significant difference in HGT estimates between CI and CII (P-Value = 0.180), although this difference was more pronounced than that observed using Island Viewer (P-Value = 0.790). As observed with Island Viewer, the difference within every subgroup was also not significant as P-values for α-subgroup, β-subgroup and γ-subgroup were 0.4231, 0.4064 and 0.4354, respectively. Together, these results suggest that HGT events are not a significant player in the rapid evolution of accessory chromosomes in Proteobacteria.
Figure 3. A schematic representation of HGT estimates within chromosomes I (CI) and II (CII) of Rhodobacter sphaeroides 2.4.1. HGT estimates are obtained by employing an integrated segmentation and clustering method. Each chromosome contains both native sequence segment in red color and alien sequence segments in blue, green, and yellow colors. Gene clusters are organized with their corresponding coordinates in CI and CII. CI contains 153 segments, of which 67 segments represented a single large native cluster in red color and the remaining 86 segments represented 16 different gene clusters in different colors. CII contains 45 segments, of which 22 segments are in a single large native gene cluster in red color and the remaining 23 segments representing four alien gene clusters in different colors.
Figure 4. Proportion of vertically inherited (“Native”) and horizontally acquired (“Alien”) DNAs within the chromosomes of α-, β-, and γ-subgroup Proteobacteria, deciphered using an integrative method of segmentation and clustering. Both segmentation and clustering were performed within a statistical hypothesis framework with significance levels for segmentation and two-stage clustering set at 10−5, 10−6 and 10−7, respectively. Both segmentation and clustering procedures employed Markovian Jensen-Shannon divergence measure for quantifying difference between DNA sequences. Homogeneous Markov models of order 2 were used for analysis.
3.4. Selective Constraints of CI and CII
Orthologs present on CI and CII within the 50 strains were identified and selective constraint analysis of CIs and CII of the 16 species was performed. The number of orthologs between each species, their multi-way comparisons, and their average Ka, Ks, and ω values are presented in Supplemental Table S4. In all the 16 species, the majority of the orthologs on CI and CII are under negative selection as shown in Figure 5; however, the average selective constraint value for CI (ω = 0.124) was lower than that for CII (ω = 0.196). T-test indicated a significant difference between CI and CII selective pressures among the 16 species (P-Value = 0.038).
4.1. Differential Evolution of CI and CII in Prokaryotes
Mauve analysis revealed that sequence conservation and DNA sequence divergence levels for CIs and CIIs are significantly different in within-species comparisons. Thirteen out of sixteen such comparisons revealed that CIIs have a higher nucleotide divergence than their corresponding CIs, and thus CIIs evolved faster than CIs. However, in species Ralstonia pickettii CII is highly conserved and CII might have originated more recently in this species and may not have had enough time to diverge. Alternatively, genes on CII in this species
Figure 5. Average selective constraint (ω) value of CI (in blue) and CII (in green) in the α-, β-, and γ-subgroups of Proteobacteria. The number of strains used in each species comparison is specified in parentheses. Pairwise comparisons of orthologs located on the corresponding CI and CII across species compute the rates of non-synonymous substitution (Ka), synonymous substitution (Ks). The selective constraints (ω = Ka/Ks) are averaged.
are functionally constrained and possibly involved with the species’ pathogenic lifestyle. Also, no significant differences in CI and CII sequence divergence exist in other pathogenic strains examined in this study. This finding suggests that a higher level of purifying selection operates on the genomes of pathogenic strains, allowing organisms to retain CII encoded essential functions that aid in pathogenicity.
In contrast, there is a significant difference in genetic divergences between CI and CII in non-pathogenic bacteria. The rapid evolution of CII in the majority of bacterial species, especially in non-pathogenic strains, is possibly due to relaxed selection on CII compared to CI and/or more frequent HGTs on CII than on CI.
4.2. Similar HGT Estimates in CI and CII
HGT plays an important role in the evolution of bacterial genomes  , metabolic functions  , and re-patterning of the regulation mechanisms  . Since CIs are longer than CIIs, we presume that CIs may have higher HGT estimates than CIIs. Alternatively, as CIs possess more essential genes than CIIs, CIs are evolutionarily more constrained than CIIs. Therefore, in spite of the greater length of CIs, smaller CIIs may have a higher number of HGT estimates. Results of this study do not exhibit a significant difference of HGT estimates between CIs and CIIs across the bacterial species examined. Therefore, HGT may not appear to be a major mechanism contributing to the rapid evolution of CII in Proteobacteria. Even when the genomes were examined separately by pathogenic and non-pathogenic group, the results still indicated no difference in HGT levels between CI and CII. The role of HGT in the initial phase of origin and development of the accessory chromosome may be important; however, it may not contribute much through remaining evolutionary processes.
Since HGT is a continuing process, the amount of HGTs present in replicon is possibly indicative of the relative evolutionary history. The high HGT estimates in certain CIIs could suggest a recent origin of the accessory chromosome, while similar HGT estimates on CI and CII may suggest ancient origin of CII. In the case of R. sphaeroides, CI and CII have similar HGT estimates. However, three out of the four strains have slightly higher HGT estimates on CII. The oldest strain ATCC 17025 has higher HGT estimates on CI, which supports the theory of the ancient origin of CII   , while the other three strains that emerged later display higher HGT estimates on CII than on CI. The older CIIs may have lost a substantial fraction of earlier acquired alien DNAs. It is also possible that the old alien genes may have ameliorated their composition to that of the native genome and therefore may not be detected by Island Viewer. On the other hand, foreign genes in recently originated CIIs might not have substantially ameliorated yet and therefore they may be identified by Island Viewer. Furthermore, among all the three subgroups, α-subgroup displayed slightly higher HGT estimate for CIthan CII.
4.3. CI and CII Are under Purifying Selection
Selective constraint (ω) is a measure of the ratio of the synonymous substitution rate (Ks) and nonsynonymous substitution rate (Ka). It signifies the mode of selective pressures  : negative, neutral, or positive selection. The majority of orthologous gene pairs located on both chromosomes have selective constraint (ω) values < 0.3, which demonstrates that regardless of their locations, these orthologs were maintained under purifying selection. Although both chromosomes are under purifying selection, selective constraints on CII are more relaxed than that on CI. The differential selective pressure on multipartite genomes supports the rapid divergence of CII. As genes on the accessory chromosomes experience weaker negative selection because they are possibly semi-essential, not always expressed, redundantly connected within gene network, and/or more susceptible to mutation  ; thus, they diverge more rapidly than their corresponding primary chromosomes. Previous studies have demonstrated that the synonymous codon usage orderliness (SCUO) was significantly less on CII than on CI  . This codon usage bias is reflective of the decreased purifying selection on CII. Additionally, codon usage bias varied greatly among genomes and was highly dependent on their G + C composition. Furthermore, as the reduced SCUO is an inherent attribute of genes on CII, they experience reduced selection for translation efficiency because of either their reduced expression or greater protein dispensability  . CIIs of most multipartite genomes harbor more nonessential genes including genes of hypothetical protein and unknown functions. This suggests that the nonessential genes possibly evolve faster and therefore they are much more suited for biodiversity including in the emergence of new strains and species.
4.4. Evolutionary Mechanisms for the Origin of CIIs
Two different models have been previously invoked to explain the origin of CIIs in bacteria  . In the first model, when a native or a newly captured plasmid from another species secures some essential genes via intra- or inter-genomic gene transfer from CI, this plasmid becomes non-dispensable for bacterial survival an growth under all growth conditions. In this scenario, the presence of a plasmid type of replication-origin and some essential genes for the cell survival will be preserved in the cell. In the second model, CI in the ancestral cell breaks apart to form two chromosomes of unequal sizes. As the two newly formed chromosomes still maintain a chromosomal origin of replication and many essential or important gene functions, they are stably retained in the cell. For example, the origin of second linear chromosome of Agrobacterium tumefaciens C58 as well as CII (pSymA) and CIII (pSymB) of Sinorhizobium meliloti is supported by the plasmid origin model, while the second circular chromosome of Brucella melitensis 16M and Rhodobacter sphaeroides 2.4.1 revealed that CIIs originated from their ancestral primary chromosomes.
4.5. Evolutionary Role of CIIs
The multipartite genome structure in bacteria has three advantages, which include shortening the genome replication time, specializing the CII for a specific or pathogenic environment, and evolving new metabolic functions for generating species diversity. The first advantage is to reduce the duration of the genome replication and cell doubling or cell cycle, when a large primary chromosome splits into two chromosomes; therefore, the multipartite genome structure helps bacteria to cope with the nutrient-rich environments. The second advantage is to differentially regulate the expression of genes on different chromosomes under a specific host or free-living condition. It was validated that in Vibrio cholera, a pathogenic bacterium, CII-specific genes involved in various metabolisms and pathogenicity are highly expressed under pathogenic conditions, while CI-specific genes are expressed at similar levels under both free-living and pathogenic conditions  . The current study has shown that the orthologs located on CII compared to the corresponding orthologs present on CI are structurally and functionally less constrained. Therefore, CII-specific genes may evolve more rapidly and thus generate high level of strain or species diversity within a specific group of bacteria.
This study concludes that CII-specific sequences have higher sequence divergence than their corresponding CI-specific sequences, and therefore CII evolved rapidly. The results of previous and current studies suggest that both differential selective constraint and pathogenic/non-pathogenic lifestyle seem to be the major driving force for the genetic divergence between the primary and the accessory chromosomes. Although HGTs do not seem to be a major factor for the differential sequence divergences of CIs and CIIs, the role of HGT for genome evolution should not be underestimated based on the limited number of species examined. Future studies focused on transcriptome and proteome analyses will help establish if gene orthologs located on primary and accessory chromosomes are differentially expressed for adaptation in specific host-environments or free-living growth conditions.
We thank the Department of Biological Sciences of Sam Houston State University for financial support. We also thank Lauren Flores for proofreading the manuscript.
Conceptualization, M. C., H. C., and R. K. A.; methodology and investigation, C. T., R. S. P., H. C., M. C., and R. K. A.; formal analysis, C. T., H.C., M. C., and R. K. A.; manuscript preparation, M. C., H. C., U. S., and R. K. A
Table S1. Chromosomes size, gene count, and gene density (genes/kb) on CI, CII, and CIII of 100 bacterial strains. All chromosomes are circular in structure unless denoted with an (L) which are linear in structure.
Table S2. (a) Percent conservation and percent nucleotide identy of CI and CII among pathogenic species; (b) Percent conservation and percent nucleotide identy of CI and CII among non-pathogenic species.
Table S3. HGT estimates of CI and CII in 83 bacterial strains.
Table S4. Average Ka, Ks, and ω values of CI and CII in16 species within α-, β-, γ-subgroup of Proteobacteria. The number of strains used in each comparison is indicated in parentheses.