The definition of heavy metals has differed over the years, beginning with defining heavy metals as metals with a density of five times greater than water  and then as metals with densities above 4 - 5 g/cm3   . There are about 30 metals and metalloids within the heavy metals group, including zinc (Zn), mercury (Hg), gold (Au), lead (Pb), cadmium (Cd), copper (Cu), silver (Ag), platinum (Pt), arsenic (As) and chromium (Cr), which have densities that are greater than 5 g/cm3  . Heavy metals such as zinc, magnesium, copper, chromium, or nickelmay have a nutritional benefit to the organism as cofactors, while other metals, such as lead, cadmium, mercury, arsenic, and gold, are not yet identified beneficial to the organism  . Regardless of the nutritional benefit, all metals lead to toxic effects when accumulated in high concentrations in the cell. The toxicity of the metals is dependent upon the concentration, but also the chemical structure, time of exposure, and the source of the metal contamination  . Heavy metal contamination, particularly in the air, soil and water, is a major problem for their toxic effects worldwide   . The toxic contaminants come from a variety of sources, including industrial effluents, gold mines, acid rain, and metal ions leaching out into the soil. Each metal has a different concentration at which it is deemed to be toxic to both the environment and the human body. These toxic pollutants pose serious health risks to humans, including bone loss  , kidney damage  , neurological damage  , skin cancer  , and lung cancer  . Some of these metals, including chromium, cobalt, and nickel, play a vital role in metabolic processes such as essential micronutrients, stabilizing molecules  , and catalysts in enzymatic reactions  , help regulate osmotic balance  , and involve redox reactions  .
Whether essential or non-essential, heavy metals become toxic to organisms at high concentrations, resulting in bioaccumulation, modifications of conformational structure of nucleic acids and proteins, damage to the DNA and cell membrane, and interference with the oxidative phosphorylation and osmotic balance  . Resistance mechanisms to heavy metals have been identified including intracellular and extracellular sequestration, exclusion by permeability barrier, efflux pumps, active transport, reduction of heavy metal ions and cellular targets, and enzymatic detoxification  . Proteobacterial species are metabolically versatile and several species have been previously shown to be heavy metal resistant, including gold, silver, and platinum. Also, Rhodobactercapsulatus, a species closely related to R. sphaeroideshas demonstrated considerable gold resistance and bioaccumulation of gold bio-nanoparticles. An explosive volume of biological data, including but not limited to genome and transcriptome data, has necessitated computational tools in order to efficiently manage and analyze the large genome databases. One of the most used approaches for bioinformatics study is the analysis of large numbers of gene or protein sequences of the genomes that are fully annotated  .
To better understand the mechanisms of tolerance, analysis of the genomes of the heavy metal resistance genes can provide information on the distribution of the heavy metal genes in specific groups of microorganisms which confer the ability to tolerate the metal contaminations. In this study, bioinformatics approaches are used to analyze the gene and protein sequences of Proteobacteria to examine their potential use for the bioremediation of heavy metals. The protein sequence files (.ptt) were downloaded from the National Center for Biotechnology Information (NCBI) database and the distribution of the heavy metal tolerance genes identified within different subclasses (α Proteobacteria, β Proteobacteria, γ Proteobacteria, δ Proteobacteria, and ε Proteobacteria) of eubacteria was studied.The two hypotheses are tested in the current study. First, the heavy metal related genes are more abundant in the Proteobacteria. Second, the genome of R. sphaeroides, which belongs to the group of proteobacteria, contains heavy metal resistant genes representing cellular metabolism, which includes gene functions like transport, energy production, and macromolecular biosynthesis. If these two hypotheses are validated, future study will utilize R. sphaeroides as a model species to study the bacterial tolerance against other heavy toxic metals.
2. Materials and Methods
2.1. NCBI Database
Within the NCBI database, 82,895 genomes were available before the NCBI FTP site was used, which include Eukaryotes (3494 genomes), Prokaryotes (73,708 genomes), Viruses (5673 genomes) as well as some organelles and synthetic plasmids  . The total number of genomes that are completely sequenced within the database includes 5654 bacterial genomes, of which 2526 unique species exist. The genome database contains many partially sequenced genomes, which have not yet been annotated; therefore, they are excluded in the current study.
2.2. Identification of Heavy Metal Resistance Genes
The tarball file, all.ptt.tar.gz, at the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/all.ptt.tar.gz) contains all the NCBI Protein Table (.ptt) files for bacteria. Each .ptt file is a tabdelimited file containing a list of all the proteins and their information collected from the GenBank file for each genome. Particularly, the CDS annotation (i.e. Product) in .ptt file is targeted to search heavy metal related terms. A keyword, for example, “heavy metal,” was used as a search query to recognize genes that contain the searched keyword(s). Both the metal name and the metal symbols of the 18 most common metals were included in the search command in order to identify any possible combination of annotations within the database. The total list of keywords used in the search is given in Table 1.
2.3. Distribution of Heavy Metal Genes across Bacterial Species
The total number of heavy metal genes identified in each bacterial group was divided by the total number of genes within the group to obtain the frequency of the heavy metal related genes. The distribution was calculated for each bacterial group in which heavy metal related genes were identified. Additionally, the heavy metal gene distribution of the Proteobacteria, out of all bacteria (2489 in total) was identified by counting only those bacteria belonging to the Proteobacteria group. Furthermore, the number of species as well as the number of genes related to the heavy metal tolerance, function, or annotation within these species
Table 1. Key words used in heavy metal gene identification. YES designates a result retrieved. N/A designates no result retrieved.
in each of the five subgroups (α Proteobacteria, β Proteobacteria, γ Proteobacteria, δ Proteobacteria, and ε Proteobacteria) of Proteobacteria were counted.
2.4. Analysis of R. sphaeroides Genome
The amino acid FASTA files for Chromosome I, Chromosome II, Plasmid A, Plasmid B, Plasmid C, Plasmid D, and Plasmid E of R. sphaeroides were downloaded from the NCBI FTP site. All amino acid FASTA files for all the heavy metal genes were retrieved from the NCBI through Entrezefetch function with the accession numbers of heavy metal genes as unidentified UIDs. A BLAST (Basic Local Alignment Search Tool) search was performed with each replicon of R. sphaeroides as a query and the heavy metal related genes as a database. The following criteria were applied to filter the search results: Amino acid identity >50%, E-value <0.001, and bit score >100.
2.5. Analysis of Cluster of Orthologous Group Functions
Once the heavy metal tolerance genes were identified within R. sphaeroides, the functional annotation of the genes was analyzed. Each protein file of the identified heavy metal tolerance gene was accessed and the cluster of orthologous groups (COGs) was analyzed  . The information was organized into major groups and minor COG subgroups as depicted in Figure 1 and Figure 2.
3. Results and Discussion
Of the 2526 species in the database, 2489 were found to have genes associated with heavy metal transport, reduction, and/or resistance. A total of ~170,000 genes related to heavy metal tolerance were identified, across bacterial species, using the key term searches as shown in Table 1.
Figure 1. Distribution of heavy metal tolerance genes identified across both chromosomes (CI, CII) and plasmids of R. sphaeroides among the major COG groups.
Figure 2. Distribution of heavy metal tolerance genes identified across both chromosomes (CI, CII) and plasmids of R. sphaeroides among the minor COG groups.
A total of 2489 bacterial genomes were analyzed and it was found that the group containing the most genes related to heavy metal resistances included the Proteobacteria group. This widely diverse group of bacteria is ideal for the study of heavy metal bioremediation due to the large number of genes associated with heavy metal tolerance, which encode transporter, sensor proteins, transcriptional regulators, and oxidoreductive enzymes. Because of the presence of heavy metal related genes within these genomes, the Proteobacteria group has been extensively studied under metal stress conditions. Through the analysis of the distribution of heavy metal genes in bacterial species, the total number of heavy metal genes can also be identified in this way. Because of the high level of interest in the toxic effects of metal contaminants, it will be useful to set adirectory of heavy metal related genes across bacterial species. A directory of this type will provide a database so as to analyze organismal genomes compared to this derived database  . A large number of bacterial genomes that have been completely sequenced and annotated several heavy metal related genes, but those genes are not currently stored in a separate database available for analysis. If a database was organized with the annotations collected that are related to heavy metals, the database could be used for further analysis of organisms thought to be capable of bioremediation.
3.1. Distribution of Heavy Metal Related Genes within Bacteria
After extracting all of the .ptt files that contain the key words listed in Table 1, the files were arranged by bacterial taxonomic groups in order to visualize the distribution of heavy metal genes. While the entire bacterial database was analyzed, only the groups that contained the key words were extracted. The groups, which were separated by order, were compiled based upon the total number of heavy metal genes and the total genes within the bacterial group. The distribution of the heavy metal genes is shown in Figure 3.
The two main groups, Proteobacteria and Terrabacteria contain 46% and 39% of the heavy metal related genes, respectively. Proteobacteria is comprised of a wide array of bacterial species with diverse metabolic pathways with a large number of species that are photosynthetic, making them ideal candidates for bioremediation purposes. The Terrabacteria group is also widely diverse and has been examined for bioremediation studies, and has shown potential with environmental hazards, although there are fewer studies on heavy metal contamination within this group  . The high percentage of genes related to heavy metal resistance in these two major groups of bacteria suggests that the heavy metal resistance genes have possibly evolved multiple times; however the wide distribution of the heavy metal genes also supports the notion that many other bacterial species have acquired these genes by horizontal gene transfers (HGT), which can be validated upon further phylogentic analysis. As the majority of the of heavy metal genes was found in the Proteobacteria, the group was split into the sub-groups of α Proteobacteria, β Proteobacteria, γ Proteobacteria, δ Proteobacteria, and ε Proteobacteria to analyze the distribution of heavy metal genes
Figure 3. The distribution of heavy metal genes across different bacterial groups. The highest distribution is found within Proteobacteria and Terrabacteria (46% and 39%, respectively). Other bacterial groups identified to contain heavy metal related genes included Euryarchaeota, FCB group, TACK group, PVC group, Spirochaetes, and Thermotogae group in descending order.
within Proteobacteria. For the analysis of the δ/ε subdivisions, the two groups were combined in order to fully capture all the related heavy metal genes within the NCBI database as the two groups are often annotated together in the gene and protein files. The distribution of heavy metal genes within the Proteobacteria is shown in Figure 4.
Results reveal that γ Proteobacteria harbors the highest frequency of metal resistance related genes, followed by α Proteobacteria, β Proteobacteria, and δ/ε Proteobacteria. Previous studies have also shown that when sampling a heavy metal contaminated area, the α Proteobacteria are found to be the most prevalent bacterial species  . Although the majority of the bacteria found within heavy metal contaminated areas belong to the α Proteobacteria, it is suggested that the heavy metal resistant genes originated within the γ Proteobacteria and then moved to the other subgroups, particularly α Proteobacteria, through horizontal gene transfer. It has been demonstrated that R. sphaeroides, a member of Proteobacteria, has in the past acquired genes from the γ Proteobacteria. A high percentage of heavy metal genes (19%) within the α Proteobacteria group suggests that members of α Proteobacteria and γ Proteobacteria previously shared a common niche to facilitate horizontal transfer of genes that conferred heavy metal resistance phenotype.
3.2. Heavy Metal Related Genes in R. sphaeroides
Due to extensive studies of Rhodobacter sphaeroides interactions with the heavy
Figure 4. Distribution of heavy metal related genes among different groups of Proteobacteria. The distribution found within the Proteobacteria included the highest occurrence of heavy metal genes found within the γ Proteobacteria, followed by α Proteobacteria, β Proteobacteria, and δ/ε Proteobacteria.
metals and oxides, such as tellurite and arsenic, it is the ideal bacterium for the studies of heavy metal bioremediation  . Rhodobacter sphaeroides is a bacterium that belongs to the α Proteobacteria group. As previously mentioned, the α Proteobacteria group has been extensively studied under heavy metal contamination conditions, and analysis of the genome of this bacterium reveals the presence of the heavy metal tolerance genes. Because of the presence of the resistant genes, this organism, R. sphaeroides, is a good model bacterium to further explore the heavy metal bioremediation.
The genome of R. sphaeroides was found to contain a total of 375 heavy metal resistance genes, which are distributed on both chromosomes (CI and CII) as well as on four of the five plasmids (Plasmids A, B, D, E). Plasmid C lacks any heavy metal related genes. The distribution of the heavy metal tolerance genes across the genome of the organism suggests the importance of the role the genes play to the survival of the bacterium under heavy metal stress growth conditions.
3.3. Cluster of Orthologous Group Functions (COGs) of Heavy Metal Tolerance Related Genes
Upon analysis of the functional annotation of the heavy metal tolerance genes found within R. sphaeroides, the majority (255 genes, ~63%) of them are metal dependent enzymes or enzymes that reduce the metallic compounds into elemental metals. The second largest group (127 genes, ~34%) represents transporters that include both metal binding proteins and ATPase translocases. As previously mentioned, the main mechanisms of bacterial tolerance are involved in sequestration, reduction, and transportation. The presence of such many genes of heavy metal transporters within the genome of R. sphaeroides makes the bacterium a good model system to study heavy metal tolerance and bioremediation.
Results of the functional COG analysis depict the highest number of heavy metal resistance genes found within R. sphaeroides belong to the third major group, cellular metabolism (COG 3). In the presence of the heavy metal contamination, one of the tolerance mechanisms to the toxic metals includes the transport of the metallic ions and the enzymatic detoxification. As the concentration of the toxic metals increases to the toxic level, or at least to a level at which the results of the bacterium are altered, the tolerance mechanisms of the bacterium would be in effect in order to reduce the toxic metallic effects. As part of the functional annotations of the genes within the major group of COG 3, it is supported with the mechanisms of tolerance. As the metal contamination increases, the cell or organism must be able to either enzymatically detoxify the metal ions in order to reduce the toxicity, or transport those toxic ions outside of the cell. Both of these functions fall within the major COG 3 group.
The first major group (COG 1) is classified as information storage and processing, which includes minor groups related to RNA processing and modification (A), chromatin structure and dynamics (B), translation, ribosomal structure, and biogenesis (J), transcription (K), and replication, recombination and repair (L). The second major group (COG 2) is classified as cellular processes and signaling, which includes the minor groups related to cell cycle control, cell division, and chromosome partitioning (D), cell wall/membrane/envelope biogenesis (M), cell motility (N), post-translational modification, protein turnover, and chaperones (O), signal transduction mechanisms (T), intracellular trafficking, secretion, and vesicular transport (U), defense mechanisms (V), extracellular structures (W), nuclear structures (Y), and cytoskeleton (Z). The third major group (COG 3) is classified as metabolism, which includes minor groups related to energy production and conversion (C), amino acid transport and metabolism (E), nucleotide transport and metabolism (F), carbohydrate transport and metabolism (G), coenzyme transport and metabolism (H), lipid transport and metabolism (I), inorganic ion transport and metabolism (P), and secondary metabolites biosynthesis, transport and catabolism (Q). The fourth major group (COG 4) is classified as poorly characterized and includes minor groups related to general function prediction only (R) and function unknown (S)  .
The results provided by the COG analysis suggest a metabolic component (COG 3) may be responsible under heavy metal contaminated conditions, which can be metal dependent or independent. As an example of a metal specific metabolic response, Salmonella enterica serovar typhimurium contains a transcriptional regulator, STM0354, which has shown the ability of detecting gold (Au) ions specifically with high expression levels in the presence of gold ions, particularly with the toxic salts  . This transcriptional regulator was renamed as golS for gold-resistance sensor, and is closely related to the copper sensing regulators (MerR and CueR) that are identified in E. coli and Salmonella, respectively. There are two closely located genes, STM0353 and STM0355, which are annotated, respectively, as a cation transporter ATPase and a copper chaperone  . These two genes have also been renamed as golT and golB to reflect the interaction with gold ions, as well as the interaction with the golS gene. The annotation of gold is not present within the gene database, and subsequently the Salmonella genes were not present in the heavy metal gene database that was compiled in this study. However, upon analyzing the genes within Salmonella, STM0354, STM0354, and STM0355, against the genome of R. sphaeroides, these gene homologs were identified.
The distribution of heavy metal genes favors those bacteria within Proteobacteria, particularly with γ Proteobacteria and α Proteobacteria. Many studies have been previously performed on bacteria within Proteobacteria on heavy metals such as arsenic and mercury, but studies with gold have not been as extensive    . Since R. sphaeroides belongs to the α Proteobacteria and the presence of gold specific genes has been identified within the genome, this bacterium is chosen as the model organism for this study as well as a potential organism for bioremediation of heavy metals.
In conclusion, the majority of the heavy metal tolerance related genes is found within Proteobacteria, specifically within the subgroup of γ Proteobacteria    . This group of bacteria has been extensively studied under metal contamination and are good candidates for the bioremediation of the toxic metals using the microorganisms within this group. The organism R. sphaeroides contains 375 heavy metal related genes that may be used for further analysis of the heavy metal tolerance. The results of this study show the benefit of using the bioinformatics approaches to validate biological experiments, as the heavy metal gene identification provides further insight into the mechanisms of metal tolerance within organisms such as R. sphaeroides. Future work will include the whole genome expression analysis under different growth condition with and without heavy metal contaminations. The high expression of genes under the corresponding heavy metal contamination will allow further identification of specific heavy metal related genes, whose involvement can also be validated by mutant analysis.