S-layers are one of the most commonly observed prokaryotic cell surface structures. They are composed by two-dimensional arrays of proteinaceous subunits (S-layer proteins), presented in almost every taxonomic group of walled Bacteria and almost universal in Archaea  . S-layer proteins are one of the most abundant biopolymers on our planet, as they account for approximately ten percent of cellular proteins in Archaea and Bacteria  . They are generally composed of a single molecular species that can assemble on the cell surface into closed regular arrays. Therefore, S-layer can function as protective coats, molecular sieves, molecule and ion traps, promoters for cell adhesion, immunomodulators, surface recognition, because of their repetitive identical physicochemical properties and identical pores in size and morphology  .
In Gram-positive bacteria, S-layers attached to the rigid peptidoglycan-containing layer. S-layers completely cover the cell surface during all stages of cell growth and division. Chemical and genetically analysis of many S-layers has revealed a similar overall composition  . They are generally composed of a single protein or glycoprotein species with molecular masses ranging from 40 to 170 kDa  . Most S-layers of bacteria are composed of weakly acidic proteins or glycoproteins, contain 40% - 60% hydrophobic amino acids, and possess few or no sulfur-containing amino acids  . The pI values of the proteins range from 4 to 6. However, pIs of the S-layer proteins between 8 and 10 have been determined in Lactobacilli  . Comparative studies on S-layer genes of organisms from different taxonomic affiliations revealed that homologies between nonrelated organisms are low despite the fact that their amino acid composition shows no significant difference  . Nevertheless, it is obvious that common structural principles must exist in S-layer proteins (e.g. the ability to form inter-subunit bonds and to self-assemble into monomolecular arrays, the formation of hydrophilic pores with low unspecific adsorption, and the interaction with underlying cell envelope components).
Bifidobacteria are generally recognized as safe (GRAS), exerting many beneficial health effects on their host, and have attracted strong interest in the health care and food industries  . Although a large amount of knowledge has accumulated on the structure, assembly, chemistry, and genetics of S-layers  , little data are available about their specific presence in bifidobacteria. In this study, we try to survey the distribution and study the genetics as well as structures of S-layers by bioinformatic approaches.
2. Material and Methods
2.1. Sequence Search
The S-layer protein sequences of bifidobacteria were searched from NCBI-Identical Protein Groups (IPG) with the key words “S-layer domain protein AND Bifidobacterium” (https://www.ncbi.nlm.nih.gov/). The resulting 49 protein sequences annotated as either “S-layer (domain) protein” or “putative S-layer (y) domain protein” were used for primary analysis. Domain Enhanced Lookup Time Accelerated BLAST (DRLTA-BLAST) conducted a second search with the longest consensus regions as queries when expected threshold was 4.0. The queries of S-layer (domain) protein (P146, YVNFGKGD, 8aa) and putative S-layer (y) domain protein (P277, QLVTWVESHDNYAN, 14aa) were obtained when threshold was set at 100% by local ClustalW multiple alignments  .
2.2. Multiple Alignment and Phylogenetic Analysis
Protein sequences of S-layer (domain) protein and putative S-layer (y) domain protein were then aligned separately by local ClustalW program version 2.0 with the progressive method  . Sequences too short or significantly different from the others were removed in the final alignment. Consensus regions were recognized when at least four continuous amino acids are identical and the threshold is 100%. Sequences upward the first or downward the last consensus region were deleted. Remaining sequences were used for the construction of phylogenetic tree by the neighbor-joining method using Protein-Dist program incorporated in BioEdit with 1000 bootstrap replicates (version 6.0) as described elsewhere  .
2.3. Physicochemical Analysis and Motif Scan
Representative sequences, including S-layer proteins of B. thermophilum RBL67 (Accession: AGH41482.1), B. pseudocatenulatum LMG10505 (Accession: KFI75572.1), and B. longum DJO10A (Accession: ACD98337.1), belonging to Clusters 1, 2, and 3, respectively, were analyzed by ProtParam tool at ExPASy (http://web.expasy.org/protparam/). The sub-location of S-layer proteins was analyzed by PSORTb v3.0.2 program as well (http://www.psort.org/psortb/index.html). All motifs in S-layer (domain) proteins were screened by MOTIFS program (http://www.genome.jp/tools/motif/). Above representative sequence in each cluster was used as example for illustration of conserved and/or unique structural motifs. The database used for the search is Pfam library and the E-value is 1.0 with Profile Hidden Markov Model  . The motifs were analyzed by comparison of their structural characteristics annotated in PDB (https://www.rcsb.org/).
2.4. Analysis of Functional and Structural Regions
Potential signal peptide (SP) sequences of all S-layer protein were analyzed using SignalP Version 4.1 and TATFIND   . TMHMM Server predicted potential trans-membrane helices in protein v. 2.0 (http://www.cbs.dtu.dk/services/TMHMM/). Conserved domains were computed by batch Web CD search Tool with default setting  .
2.5. Structural Modeling of S-Layer (domain) Protein
The sequences of representative S-layer (domain) proteins from each cluster of the phylogenetic tree were searched for closest homologues in protein data bank (PDB) database using NCBI-BLASTp search program with the algorithm of DELTA-BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp). Meanwhile, the three sequences were used as template separately for structure prediction. Five homology models were obtained for each sequence from RaptorX server (http://raptorx.uchicago.edu/StructurePrediction)  . The models were evaluated and the best model was chosen based on P-value as well as score. The superposed structure was visualized using the web-based 3-D structure viewer iCn3D in Ribbon style (https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html). Furthermore, homology models of these three proteins were built separately by SWISS-MODEL (https://swissmodel.expasy.org/). Finally, QMEANBrane estimated the quality of modeling (https://swissmodel.expasy.org/qmean/).
3. Results and Discussion
3.1. Distribution of S-Layer (domain) Proteins in Bifidobacterium
Search for S-layer protein in the identical protein groups in NCBI yielded 49 hits belonging to bifidobacteria. The sequences that were annotated as either S-layer (domain) protein or putative S-layer (y) domain protein were downloaded. Basic information of these proteins was summarized in Table 1 and listed according to species. It is clear that S-layer and putative S-layer proteins distributed in only 26 known species and 1 uncharacterized specie of the genus of Bifidobacterium. In reference to the most recent bacterial classification and sequence record, no (putative) S-layer homologues could be identified in other 30 species of the genus of Bifidobacterium (https://www.dsmz.de/fileadmin/Bereiche/ChiefEditors/BacterialNomenclature). However, we noticed that there is more than one sequence in many hits that have identical sequence but with different name/annotation from NCBI-IPG. Therefore, we performed the secondary search in the non-redundant protein sequences database with two consensus regions P146 and P277 by DRLTA-BLAST, respectively. Results suggest S-layer (domain) proteins are more widely annotated as either “hypothetical protein” or “ABC transporter permease”, while putative S-layer (y) domain protein are mainly annotated as “hypothetical protein” or “alpha amylase” in nearly all species of Bifidobacterium. Furthermore, we found that there is more than one (putative) S-layer (domain) protein in several species/strains. For example, B. scardovii LMG 21589 has two S-layer (domain) proteins; B. choerinum LMG 10510 has two putative S-layer (y) domain proteins; while B. gallicum LMG 11596 has one S-layer (domain) protein and as much as 6 putative S-layer (y) domain proteins.
3.2. Conservation and Phylogeny of S-Layer (Domain) Protein
Multiple alignments of 24 S-layer (domain) proteins yielded several consensus regions. The longest consensus region YVNFGKGD was marked as P146. As
Table 1. Bifidobacterial S-layer (domain) proteins in NCBI-identical protein groups.
shown in Figure S1, S-layer proteins are quite conserving in 14 different species in Bifidobacterium as the general identity reaches 50% (number of identical amino acids 80/150 analyzed). Phylogenetic analysis of 24 S-layer (domain) protein sequences groups them into three distinct clusters, with the majority species in Cluster-2 (Figure 1). Similar analysis of putative S-layer (y) domain sequences in ninespecies of Bifidobacteria suggests they are conserved and can be grouped into three clusters as well. We next analyzed the S-layer (domain) sequence clusters according to their habitat and found that most of the sequences in Cluster-2 distributed as endosymbionts in the gastrointestinal tract of human and other animals, while sequences in Cluster-3 is mainly belongs to B. longum, which is frequently isolated from human faces (http://www.bacterio.net/bifidobacterium.html).
3.3. Physiochemical Features and Motifs of S-Layer (Domain) Proteins
ProtParam computation of representative S-layer proteins from each cluster indicates they are stable proteins have high value of aliphatic index and close pI (detail values see Table 2). Generally, all of them have much more negatively charged residues than positively charged residues. PSORTb analysis suggests S-layer protein in B. thermophilum RBL67 localized in the cytoplasmic membrane (Localization Score 9.87/10). In contrast, S-layer proteins in B. longum DJO10A and B. pseudocatenulatum LMG 10505 may have multiple localization sites, as their localization scores are 3.33/10 in both cytoplasmic membrane, cell wall, and extracellular, respectively  .
Figure 1. Phylogentic anlaysis of S-layer domain proteins in Bifidobacteria. Tree of S-layer (domain) proteins was constructed by BioEdit Protein-Dist-Neighbor Phylogenetic Tree with 1000 bootstrap replicates.
Table 2. Physicochemical parameters of representative S-layer proteins.
No., number; GRAVY, Grand average of hydropathicity.
By MOTIFS searching of S-layer (domain) sequences extracted from NCBI-IPG, we recognized a plenty of motifs in each sequence. For simplicity, motifs in representative sequences were compared when E-value is 0.01 (Figure 2). Motifs CARDB and TMEM154 are presented on all representative sequences and each S-layer protein have unique motifs. However, there are a large number of motifs when E-value is 1.0. Comparison analysis suggests both common motifs in all the three clusters and unique motifs in each cluster. DUF4381 is the only universe motif among all species of bifidobacteria, though its function is unknown. CARDB is the second motif widely distributed among 23 of 24 sequences. CARDB (PF07705) represents cell adhesion related domain found in bacteria  . This structure supports the revealed adhesion function of S-layer protein in bifidobacteria. EphA2_TM also widely presents in bifidobacteria (22/24), which is the left-handed dimer trans-membrane domain of Ephrin receptors  . This binding leads to contact-dependent bidirectional signaling into neighboring cells, which may contribute to the probiotic effects of bifidobacteria by antagonisms. Furthermore, analysis of the structural characteristics of
Figure 2. Simplified representation of motifs in representative S-layer (domain) proteins.
these motifs indicates some important properties of S-layer protein. For example, structural motif corresponding to the first α-helix of S-layer protein is conserved in all clusters.
3.4. Functional and Structural Regions of S-Layer (Domain) Proteins
Signal peptide (SP) responses for the direction of protein secretion across cell membrane. S-layer (domain) protein needs such structural element to direct its sub-localization. SignalP-TM prediction with Gram-positive bacteria model indicates most S-layer (domain) proteins, exactly 23 of 24 analyzed, have a potential Sec dependent SP (as represented in Figure 3(a)). However, there is no Tat-dependent SP in all sequences analyzed. TMHMM Server v.2.0 prediction suggests all S-layer (domain) protein have trans-membrane (TM) helices in both N-terminal and C-terminal, even the N-terminal TM region is probably signal peptide (Figure 3(b)). Conserved domain search suggests COG1361 (PSSMID 224280) and CARDB (PSSMID 285007) super-family domains are presented in most of these sequences, though 7 proteins have no domain hits (Figure S2). COG1361 is either an uncharacterized conserved protein domain or S-layer domain involved in cell envelope and outer membrane biogenesis  CARDB domain is a cell adhesion related domain widely found in bacteria.
3.5. Structural Modeling of Representative S-layer Proteins
Search of PDB database of these three representative sequences by Blastp yielded same results. All sequences have a homological structure model to Chain A of Vibrio nigripulchritudo nigritoxine with 33% identity in 45aa (PDB ID: 5M41). However, this model represents a small partial structure, as only 141aa was included in the model. Therefore, we next generated homology models of representative S-layer (domain) protein sequences from each cluster. Five models
Figure 3. Analysis of functional and structural regions in S-layer proteins: (a) a representative signal peptide of S-layer protein (CUN56955.1) predicted by Singal P4.1 using the N-terminal 70 amino acids. The violet line is the default cutoff (score = 0.5). Red line is C-score (raw cleavage site score). Green line is S-score (signal peptide score). Blue line is Y-score, which are a combination of the C-score and the slope of the S-score; (b) TMs of a representative S-layer protein (CUN56955.1) predicted by TMHMM 2.0.
Figure 4. Homological structures of representative S-layer proteins in each cluster: a. a structure model of S-layer protein in B. thermophilum; (b) a structure model of S-layer protein in (b) pseudocatenulatum; (c) a structure model of S-layer protein in B. longum. Each model displayed above is the model predicted by RaptorX server with the highest score and the lowest P-value. N-terminus regions are in blue, and C-terminus regions are in red.
were generated for each sequence. The best predicted model was selected and demonstrated in Figure 4. Multiple structural alignments of the homology models show that all proteins have a plenty of β-meander motif that are exclusively composed by β-barrel structural architectures linked together by hairpin loops. Furthermore, similar spatial orientation displayed in motifs having same secondary structural elements and all N-terminus regions (shown in blue) have α-helices. In contrary, α-helices in C-terminus regions (shown in orange) absented in Cluster-1 (Figure 4(a), B. thermophilum). The loop regions between the first α-helix and β-sheet (from N-terminus, shown in light blue), and the regions after the last β-sheet or α-helix (shown in red) have obvious structural deviations. The electrostatic surface potential was alsoanalyzed using the predicted structure of S-layer domain protein and observed the presence of a patch of negatively charged potential on both β-sheetsand α-helixes.
In this study, we investigated the distribution of S-layer domain protein in Bifidobacterium from a phylogenetic and structural perspective. Phylogenetic analysis on all annotated S-layer protein sequences grouped them into three distinct clusters. (Putative) S-layer (y) domain proteins distributed in less than half species in bifidobacteria, though they have several conserve regions and their longest consensus sequences P146/P227 are common in nearly all species of bifidobacteria. S-layer proteins have different motifs and domains that are either involved in cell envelope and outer membrane biogenesis or related to cell adhesion. Furthermore, all S-layer (domain) proteins have a typical signal peptide sequence and a C-terminal trans-membrane region. Analysis of homological models of representative sequences revealed cluster-specific structural properties of S-layer protein.
This study was supported by a cooperation grant (No. 172102410055) from the Henan Agency of Science and Technology and a grant of Key Scientific Research Project (No. 17A180017) from the Henan Province Department of Education. The funders had no role in the study design, data collection and analysis, or decision to publish.
Figure S1. Partial representation of 24 S-layer (domain) proteins alignment by ClustalW. Conservation was indicated by consensus amino acids when the threshold is 100%.
Figure S2. Schematic presentation of conserved domains in S-layer proteins in different species belonging to Bifidobacterium. Regions labeled in yellow are domain COG1363. Regions labeled in green are domain CARDB.