Received 4 February 2016; accepted 28 March 2016; published 31 March 2016
Protein kinase (PK) mediated phosporylation of proteins occurs specifically on phosporylated protein segments (PPS), whereas down-regulation of phosphorylated sites is specifically mediated by phosphatases. PPS can markedly influence the reactivity or interactivity of distinctly located functional sites of the same molecule or molecular complex via allosteric effects  -  . In addition, PPS represent functionally important structural elements of regulatory cascades and complex interconnected networks due to their presence in PK and phosphatases or molecules involved in triggering or modulation of the same enzymes  . Though the lengths of PPS mostly do not exceed ten amino acid residues (aa), these segments are specifically recognized by different PK even in cases of small sequence differences between PPS (cf.   ).
Comparable diversified specificity of short chains is also necessary for interactions of complementary determining (hypervariable) regions (CDR) of variable immunoglobulin domains (IgV) of immunoglobulins. In accordance with this fact, it is interesting that N-terminal parts of conserved IgV (called here PPSIg) containing hypervariable region CDR1 (and also related parts of certain conserved constant immunoglobulin domains) exhibited sequence relationship to commercially accessible peptidic substrates or inhibitors of protein kinases (cf. PKSI in previous papers  -  ). The described sequence relationships were further supported by the result of successful prediction of PPS within various BLAST-accessible conserved IgV of antigenic receptors (AR). The superior prediction scores were mostly observed in two sites of these IgV. One site was located in PPSIg (the corresponding paper  contains also short recent summary of the corresponding investigation).
Due to their phylogeny, IgV domains of AR are adapted to effects of somatic sequence changes including hypermutation   . In addition, these domains contain at least several types of structures or candidate structures supporting these somatic changes (cf. following paragraph). This raises the question whether at least some of the two strands of translated NS (in fact NS denotes here nucleotide sequences of the corresponding cDNA) closely similar to NS-representatives of (conserved) PPSIg but encoding PPS composing molecules distinct from AR can perform IgV-related somatic changes with functional consequences. To gradually answer to this question, newly focused bilingual approach searching for PPSIg-related NS encoding peptidic PPS was proposed. More precisely, we look here for NS 1) encoding PPS in proteins distinct from AR, 2) forming sufficiently high and dense similarities when compared with representative mRNA segments highly similar to PPSIg- related segments and 3) containing sites necessary for functionally important hypermutation.
Hypermutation mentioned above represents in fact an enzyme-mediated processes causing alteration of DNA sequences. First of all, APOBEC family member AID (activation-induced cytidine deaminase) has been extensively investigated in studies of Ig hypermutation  . AID exhibits a tumorigenic effect, when it is transfected and constitutively expressed and it can be found in current somatic cells different from lymphocytes in response to certain activating signals  -  . In accordance with its tumorigenic effect, AID hypermutates also genes encoding proteins different from AR   . AID-mediated hypermutation occurs selectively at sites of certain primary and secondary DNA structures   . First of all, this concerns well-known hypermutation motifs (HM) of sequence WRCH   and specifically distant WRCH-pairs (W-pairs) which were frequently found in NS encoding immunoglobulins   . In addition to this restriction of substrate specificity, initiating selective alignment between AID and DNA is also necessary for AID activity. Hence, this event does not occur at sites with sole HM, but only in G-loops  , and perhaps also in the other secondary DNA structures  or W-pairs proposed to trigger bi-bi-random mechanism of deamination  . Besides AID effects, mutations mediated by other APOBEC family members were observed in various non-lymphoid tissues currently not expressing AID. The research determined HM with the TCW sequence as an important target of APOBEC1, APOBEC3A, APOBEC3B, APOBEC3F and APOBEC3H reactions accompanying carcinogenesis  -  .
In summary, the goals of this paper consisted in 1) description of structurally and functionally interesting PPSIg-related PPS or PPSIg-related a posteriori reselected sequences predicted as PPS, 2) record of substantial common and extreme features of these molecules and 3) statistical evaluation of their group-related interrelationships. The corresponding search for topical PPSIg-related items occurred in five main steps (Figure 1).
About eighty percent of the predicted or existing PPSIg-related PPS found here comprised phosphorylation sites with serine processed by Ataxia telangiectasia mutated molecule (ATM), a key regulatory kinase of the DNA double-strand-break response  -  . All displayed PPSIg-related oligonucleotides contained HM WRCH and TCW at positions critical with respect to aa alterations. These critically located HM mostly occurred in non- transcribed DNA strands. The list of displayed segments contained molecules involved in carcinogenesis, cell division or specific regulatory functions.
2.1 Bioinformatic Tools
Programs of BLAST family  including conserved domain searches  were mostly accessed through publicly available NIH gate. Multiple-sequence alignment (MSA) was performed with Clustal W 2.1 present in the server of European Bioinformatic Institute  (for older version of the MSA record displayed here, see paper  ). The server with FFAS03 program   was used for scanning based on the shortest possible profile- profile alignments enabling identification of conserved fold regions (FFAS-scan; cf. WP5.1). Hypermutation motifs were searched in the preformed multi-segment constructs using EMBOSS program Fuzznuc present on the web page of Pasteur Institute   . Only PPS candidates simultaneously achieving the score 0.800 in two different machine learning programs (neural-network-based NetPhos2.0 and support-vector-machine-related KinasePhos2.0   ) were selected before the final reselection (cf. Figure 1 and sections WP2.4.1-2). On the other hand, alternative usage of databases Phospho.ELM a Phosida enabled us to find experimentally confirmed PPS  -  .
2.1.1. Typical Sequences
1) Conserved domain sequences (CDS; cf.  ), 2) phylogenically interesting actual sequence exhibiting superior similarities with CDS (IgW of clone AAB03680; igw; Figure 2;   ) or 3) actual sequence achieving repeating superior similarities in initial searches for MNSQ units (consensus-like sequence, i.e. cls, of heavy chain with clonal names AF273898.1 and AAK20241.1; Figure 2) were all denoted here as typical sequences. Besides cls, derived in addition, the typical sequences assembled initial MSA record including two important conserved sequence blocks (CSB1 and CSB2; see Figure 2, sections 2.3 and WP2.2.2).
2.1.2. Statistical Enumerations
These enumerations were performed in accordance with the textbook of Zvárová  . In accordance with basic Bayesian approximation  , all groups of values with zero observations presented in 2 × 2 tables were increased by one and then the modified odds ratio value was enumerated as OR*. Fisher exact probability of 2 × 2 tables was enumerated using Active web page  . This web page was used to statistically evaluate the validity of 1) currently enumerated odds ratios (OR) and 2) original 2 × 2 tables determining OR*. Current search limits for long sequence similarities, i.e. 40 bits (a bit score limit for middle range BLAST similarities present in all headings of BLAST records) and p < 0.005 (indeed 5 × 10−3; cf.  ) restricted the validity of a) conserved segments of initial MSA record (Figure 2; cf. BS# and E# in Section 2.2.3) and b) searches for NS representatives of AR (Figure 1; sections 2.3 and WP2.4.2-4). We have to note that it holds if p < 0.05, then p < E and p ≈ E including E = E# (cf. a formula of Dembo  ). Specific statistically derived bit score restrictions of searches for short conserve-domain-related and MNSQ-derived similarities (MNSQ denotes multi-nucleotide sequence queries; see below) of lengths comparable with subsequently searched PPS are described in Figure 1 and section WP2.3.2.
2.1.3. Overall Scheme of Employed Procedures
For the scheme including all main procedures described in this chapter see Figure 1 and SF3.
2.2. Two Types of Consensus Sequences
We distinguish here two types of consensi (see also section WP2.2.1): 1) statistically important σ-consensus containing aa achieving the highest mean column score and 2) LE-consensus composed of aa determining length equivalents (LE) related to individual sequence block column (SBC). Both these types of consensi were differently
Figure 2. Initial multiple sequence alignment, its analysis and sequence derivatives. aThe set of sequences present in this table comprises variable domain of shark IgW sequence AAB03680 (pre-selected as representative of Ig-ancestor- related sequences of Elasmobranchii origin   ). The included conserved Ig domains:_AL,_B,_G,_D-TCR related domains, i.e. alpha-like, beta, gamma and delta, respectively; K_L, L_L, H-kappa or lambda light chain and heavy chain domains, respectively. For further comments see sections 3.1 and 4.1. bTwo main conserved segments of sequence blocks (CSB1 and CSB2) are restricted here using bold characters of the corresponding consensus segments. The positions of the MSEP block segment (related to MNSQ1, MNSQ2) are indicated by gray background. Greek alphabets in the consensus denote aa differences in LE-/σ- related consensi determined here: α-P/S; β-N/S; γ-Y/E; δ-Y/A; ε-I/T, ζ-S/T; η-I/A; θ-W/S, λ-R/K. As follows, CSB1 related part of MSA-record determines unique consensus (consensus 1). cfin, fip-fuzzy related intervals in the ranges of negative or positive values, respectively. Numbers 1 - 9 in fip-/fin-related rows-values denoting hierarchy of fuzzy-related intervals, which represent the degrees of SBC similarities/dissimilarities (cf. Figure 3 and Section 2.2.3); n, p-negative and positive F-values described in Figure 3, respectively; plus and minus present in the fin row under fip value F = 3-presence or absence of the LE-value at least 1.5 necessary for candidate CBS edge, respectively (cf. Sections 2.2.4). For significance levels of SBC see sections WP4.1. dThe consensus pm3 is described in Sections 3.1 and WP5.1. Gray, dark gray proofs of pm3 segments localize positions corresponding specifically to CDR1 of light chains (CDR1light) or generally to all CDR1 (CDR1all), respectively. eFor details concerning the displayed structures see Sections 2.3, 3.1 and WP2.2.2.
used in enumerations described below and performed with minicomputer Casio Algebra 2 PLUS (cf. section WP2.1.4).
2.2.1. BLAST Derived Enumeration of LE
Each aa species present in evaluated SBC determined a non-invasive value, i.e. i-th aa-related length equivalent candidate (cf. sections WP2.1.1 and WP5.2). was defined as integer or non-integer height of artificial SBC composed only of i-th aa, provided that the probabilities of this artificial SBC and actual SBC (evaluated with respect to the selected i-th aa) were the same  . In score-related BLAST-derived evaluation of LE, we used conditioned Expect values (to include compared blank values) instead of probabilities (for details see section WP2.1.1). This evaluation finally determined the consistent formula:
where Abs() enumerates absolute value; D is coefficient of column diversity; h is SBC height equal to the number of chains in MSA record; Si represents i-th aa related score in evaluated SBC; is score of i-th aa identity; xi is the number of pointing column and row positions of scores in the employed substitution matrix (PAM30 in our case, cf. limiting SBC heights of ten in Figure 2 and section WP2.1.1); γ denotes column related gap penalty. The following formulas further characterize the Ci, D and γ:
where j denotes positions of the compared aa in SBC; λ = 0.294 is BLAST constant; g is the number of gaps; d is the number of different aa in the same column; DEL and INS are the numbers of deletions and insertions deduced from the number of gaps, respectively; Min selects the minimum enumerated value in agreement with parsimony-related attempts of our approach. For explanation of see and j-values.
2.2.2. Enumeration of Numbers Determining the Degree of Fuzzy-Related Intervals (F or F-Values)
Each SBC-related F value was determined based on the fuzzy-related system (shown in Figure 3, see also WP5.3), implicating a two-step enumeration of F value based on υ value:
if, then, else. (6)
where int denotes an integer value of the enumerated number.
2.2.3. Restriction and Double-Sequence-Related Evaluation of the Conserved Segments of Sequence Blocks (CSB) Proposed for IgV-Related Sequences
The presence of a sequence pattern (briefly pattern determined by SBC containing unique aa species) was required in each proposed CSB including otherwise 16 - 50 neighbor SBC and achieving LE ≥1.5 in both its edge SBC. In the proposed approximating fuzzy-related approach, six rules concerned F values: 1)-3) at least 50% and 20% or at most 30% of SBC kept F ≥ 4, F ≥ 6, and F < 0, respectively. Among the state-of-the-art conditions, we assume that 4)-5) each list including F-values related to ten or five neighbor SBC had to contain at least three F ≥ 3 or at least two F ≥ 1, respectively, and 6) at least one of the three following alternative rules holds for the set of SBC composing evaluated CSB: mean(υi) ≥ 3.5, mean(Fi) ≥ 3.5 or mean(LEi) ≥ 1.75 (cf. Figure 3 or formulas 5 and 6). The selected candidates of CSB were then proved using current BLAST statistics   . Hence, the values of double sequence similarity equivalents of bit score (BS#) and Expect value (E#) were enumerated, when assuming a simplified equal validity of all model chains in the evaluated reference MSA
Figure 3. Fuzzy-related system based on length equivalents (LE). Nineteen integer F-values from −9 to 9 are displayed here in the middle part of the graph. The following fuzzy-related description holds for the displayed positive F-values: 0―absence of similarity; 1―rampart of randomness; 2―area of “promising” noise; 3―quasi-similar (minimum but weak column similarities); 4―similar (close to “deterministic” double sequence aa identity); 5―quasi-cohesive; 6―cohesive (corresponds to aa identity in three compared chains); 7―quasi-rigid; 8―rigid (aa identity in four compared chains); 9―improved-rigid, i.e. maximum F-value in our single character system. Negative F-values approximate here more than random extent of diversity/variability (cf. sections the first section of Results and WP5.3). For MSA-record-compatible two-row entry of LE-derived fuzzy-related intervals indicating extents of SBC similarities see Figure 2.
record (more general CSB-chain-related linear combination can be also used, when employing “empirical”, i.e. Monte Carlo-related and database comparison-derived coefficients). The topical corresponding formulas follow:
where ρCSB(S) is score density per single chain; K, λ are BLAST constants (K = 0.11 and λ = 0.294 in given case); BSm is equal to the mean bit score between consensus and chains forming CSB (in contrast to BS#, mentioned above, BSm disregards the negative effects of column diversity, i.e. BSm ≥ BS#; for comparison of BSm and BS# see section 3.1); Dn denotes the coefficient of n-th column diversity; L ´ h is the length of chains present in MSA record including gaps; N is the number of SBC determining the proposed CSB; -is σ-consensus- related sum of scores concerning n-th SBC; denoted minimized gap penalty with respect to the three considered models (cf. WP2.1.2). In case of our enumeration, we used the simplest alternative formula:
where r is the number of chains with gaps and G denotes the number of all gaps in evaluated CSB-related segment of MSA record. The evaluation of gaps is the unique procedure in which differ evaluations of block-con- sensus and block-chain comparisons based on this section. For limits restricting valid BS# and E# see Section 2.1.2.
2.3. Generation of Two Initial Sequence Supersets Leading Separately to the Construction of Two Different Multi-Sequence Queries (MNSQ1 or MNSQ2) Representing NS of AR
In principle, two types of non-redundant different sequence sets (initial supersets of NS items) were generated, when using two types of TBLASTN records obtained with two different sets of query sequences (QS), each associated with different taxonomical restriction of searches. The first initial superset contained superior similarities with all “double conserved” sequences of CSB1 and consensus 1 (cf. Figure 1(d) in SF3), whereas the second superset comprised only the similarities found with consensus 1, three CSB1 sequences successfully examined in the previous paper  (i.e. cd00099, smart00406 and AAB03680_igv), their consensus-1-related derivatives and cls sequence found in addition during the formation of the second superset (see Sections 2.1.1, WP2.2.2 and Figure 2). The former and latter supersets contained sequences of Elasmobranchii or vertebrate origin, respectively. An iterative process of species- and query-related randomization consisted in 1) maximum number of five distinct species representatives as well as maximum overall numbers of selected items including non-redundant sequence samples in each set separately derived with individual QS and 2) existence of unlimited SIMULT set and hierarchy of QS-records both important for set-related rearrangement of promiscuously similar sequences (for details see WP2.2.3 or SF3). Initial TBLASTN searches employed BLOSUM62 matrix, whereas the following cumulative revision steps used two substitution matrices (BLOSUM62 and PAM30). For restriction of Expect and bit score values see Section 2.1.2.
2.4. Memory Problems in Some Searches with MNSQ1 and MNSQ2
Working memory problems were observed in differently advanced customer computers, when completing too extensive BLAST records obtained with initial steps of TCA searches and negatively Entrez-restricted two-step BLASTN searches limited by 35 bits (BNsup35; cf. Figure 1 and WP2.3.1). Consequently, special compromising processing of the corresponding records was necessary when downloading only well accessible but less extended records of items without reports of selected sequence alignments. These procedures comprised 1) a simplified total-bit-score-limited approach approximately substituting inaccessible frequencies of MUSAS (MNSQ- unit derived similarities with almost the same subject sequence positions; Figure 1) in the case of ternary combined approach (TCA) and 2) partial elimination of item names, immediately recognized as those determining AR, during the first step of BN35sup. For additional information see Altschul  and sections WP2.3.1-4, WP5.6 and WP5.7.
2.5. Final Selection and a Posteriori Reselection of the Displayed Items
Both final selection and reselection described in Figure 1 restricted non-redundant terminal list of items displayed in Figure 5 (for details see Sections 2.1, WP2.4.1, WP2.4.2, WP3.2 and WP3.3). Due to an unknown extent of losses following from inefficient assessing of dynamic (allosterically or proteolytically mediated) accessibility of MPL-related peptides, final selection did not include prediction of accessibility.
3.1. Analysis of the Initiating MSA Record and Its Segments
Occurrence of the two conserved block segments (CSB1 and CSB2) was confirmed based on Section 2.2.3, when analyzing MSA record assembled by Clustal W 2.1 (Figure 2). CSB1 determined the same σ- and LE- consensi (consensus 1) containing pattern CX(10,13)WXXQXP. CSB1 achieved 1) F ≥ 6 and F < 0 in 51.3% and 12.8% SBC, respectively, 2) double sequence similarity related bit score BS# = 48.2 bits and consensus 1 related mean score BSm = 54.9 bits and 3) significant Expect value E# = 1.441 × 10−10 (for the almost equal p value see Sections 2.1.2 or WP4.1). CSB2 was characterized by 54.5% and 18.2% of columns with F ≥ 6 and F < 0, respectively, and contained the pattern LX(8)DX(3)YXC. CSB2 achieved also valid values of BS# = 44.6 bits, E# = 1.461 × 10−9 and BSm = 50.0 bits. 16 of the displayed 120 SBC (Figure 2) were classified as “at most anti- cohesive” (AAC; F ≤ −6; cf. Figure 3). CDR1-, CDR2- and CDR3-related block segments contained one, four and four of sixteen AAC SBC in Figure 2, respectively. This means that CDR1 substantially differed from CDR2 and CDR3, but not from the surrounding conservative framework regions (CDR1: positions 23 - 35, RCD = 0.89; CDR2: positions 51 - 62, RCD = 3.86, CDR3: positions 97 - 110, RCD = 3.31, where RCD denotes ratio between column densities of AAC occurrence determined in a) individual hypervariable and b) all accessible segments of framework regions).
Paralogues of AR were frequently similar to N-terminal parts of CSB1-related segments of cd00099, cd07706 and cd04980, when using FFAS-scan (Sections 2.1.1 and WP5.1). The similarities between these three selected domain segments then determined the consensus pm3, which was also N-terminally located in CSB1 (Figure 2). Since pm3-derived QS only rarely achieved the required bit score limit of 40 bits (cf. Section 2.2.3) in TBLASTN searches, we do not use pm3 for MNSQ generation. Nevertheless, a pm3-related segment of MSA record was suitable as an important fold- and sequence-related conserved block segment restricting the extension of PPSIg to C-terminus (cf. section WP5.5 and PPSIg enveloping segments of MSA record, i.e. MSEP described in Figure 2).
3.2. Selection of MNSQ1 and MNSQ2
Initial selection of sequence items restricted 121 and 169 items forming starting supersets of MNSQ1 (sequences of Elasmobranchii origin) and MNSQ2 (various vertebrate sequences), respectively. Subsequently, 41 and 108 sequence items were passed through reselection with MPQ and anti-redundant procedures assembling finally MNSQ1 and MNSQ2, respectively. Although the Ig items had not any preference in our searches, sequences of T-cell receptors (TCR) composed rarely MNSQ1 (two TCR sequences) and were not present in MNSQ2.
3.3. Paths Selecting MPL
The main paths of MPL selection are described in Table 1. OR-mediated analysis of these paths revealed interesting linkages within BLASTN-derived sets (Figure 4). Each employed OR represented ratios between cancer- related-set-derived R ratio (ratio between the numbers of human and mouse items present in the corresponding elements of Table 1) and the similarly derived Q-ratios by two species-related pairs of reference sets (BN35sup and KPO; Figure 4). Eleven and one OR values indicated strong (OR ≥ 2.0) and weak (1.4 ≤ OR < 2.0) associations of cancer-related NS items with their human origin, respectively (Figure 4). These OR values appeared to form significantly increased set of values (p < 0.05) even when skeptically considering equal distribution of OR values (as discrete values) above and under the value OR = 2.0 as random and employing model constant distribution (cf. section WP3.2) or Dirichlet statistics. In accordance with Figure 4, some of the described individual
Table 1. Most effective paths of MPL selection.
aStarting procedures included BLASTN searches or combined BLAST searches (cf. footnote d). NS―nucleotide sequences of the corresponding both strands of cDNA; hu, mu, both―sequences of human, mouse and both species origin, respectively. bAbout half of MNSQ-related mRNA segments encoded peptide segments, whereas about one seventh of these mRNA segments encoded existing or predicted PPS. cAll enumerated NS segments with predicted or database-confirmed PPS relationship (denoted here as MPL) fulfilled two conditions: 1) existence of MPL-derived similarities with chains of initial MSA record overlapping CSB1, 2) presence of at least single HM* in MPL. For details see Figure 1. dIndependent paths of MPL selection differed in their initial BLAST-derived procedures. Numbers 1 and 2 in strategy names-MNSQ1 and MNSQ2 were used as query sequences, respectively; gray and white background in left column of table elements-paths of prevailing selection of mouse or human sequences, respectively. Selections using sole BLASTN: BN1sup35, BN2sup35-molecules different from antigen receptors achieving top similarities (limit 35 bits) in set without positive Entrez restriction; C1, C2―cancer related Entrez restriction (limit 30 bits); KPO1, KPO2-Entrez restriction concerning proteins involved in phosphorylation (limit 30 bits). Combined selections: TCA1TX, TCA2TX-MNSQ1- and MNSQ2-query-derived TBLASTX variants of ternary combined approach (TCA), respectively. Each of these TCA variants used a) four searches using two different Entrez restrictions in two “subpaths” composed of two cumulative TBLASTX searches differently adjusted with respect to the matrices BLOSUM62 or PAM30 and word sizes three or two, respectively, b) limit for score maxima 25 bits, c) score limit for subject-sequence-related co-localized similarities (denoted as MUSAS) 22 bits when simultaneously requiring occurrence of at least five MUSAS in addition to segment with maximum score (i.e. limit six for all MUSAS). For additional information see Figure 1, sections 2.4, WP2.3.2-4 and WP5.6-7. eDue to independent selection paths (forming non-redundant sets), some items sometimes repeat in different sets (cf. also strategy records in Figure 5, sections WP2.3.3 and WP4.2.3).
Figure 4. Odds ratio―(OR-) mediated linkage analyses. Achievement of the three limits (i.e. sufficient sample size of at least 40 NS items, significant difference p < 0.05 and strong association indicated by OR ≥ 2) is recorded here by different column fillings: black, gray―three or two limits were attained, respectively; gradient of gray―sole sufficient sample size (important in cases of low or no associations); empty columns-other cases of OR; gray in flanks―reference R ratios compared with OR values (see above). (a) (b) Selected results of OR-mediated analysis concern Table 1 (cf. also the corresponding abbreviations) and are commented in section 3.3. Numbers 1 and 2 in abbreviation-NS items found with MNSQ1 and MNSQ2, respectively; suffixes o1, o2 and o3―selection steps 1, 2 and 3 mentioned in Table 1, respectively; B, K-BNsup35 and KPO sets, respectively; C1, C2―sets of cancer-related BLASTN-records containing sequences similar to MNSQ1 (part (a)) or MNSQ2 (part (b)), respectively; R-ratios between numbers of human and mouse NS items present in C1 and C2 sets. (c) Overall (robust) evaluation of strand-related HM* occurrences in MPL displayed in Figure 5. Strong and significant associations of HM* with non-transcribed DNA strands were found. Wp―W-pairs. For details see Abbreviations and Section 3.4.
strong associations were also significant (p < 0.05) and/or robust (sample size s ≥ 40 NS). Higher or more valid R and OR values mostly formed upper MNSQ1-related graph indicating certain phylogenic context of this evaluation due to exclusive Elasmobranchii origin of MNSQ1 units (cf. section b in Figure 4). For additional comments see Section 4.3.
Predicted phosphorylated serines (Ser) achieved scores higher than 0.990, when using both NetPhos 2.0 and KinasePhos2.0. Similarly, NetPhos-related prediction yielded score maxima for threonine (Thr) and tyrosine (Tyr) higher than 0.980, but scores lower than 0.950 were determined during the corresponding KinasePhos- mediated predictions. In accordance with this difference in score maxima, only predicted phosphorylated Ser but not Thr and Tyr were present in set of double-predicted PPS encoded by MPL. This result led us to complete MPL set with four new MPL 1) different from originally found MPL, 2) encoding predicted Thr- and Tyr-re- lated PPS, and 3) achieving minimum score of 0.950, when using NetPhos 2.0. For details concerning the results of the parallel database search for experimentally confirmed MPL see the first section of Figure 5.
Another type of statistical reevaluation was performed on the subset of MPL-encoded peptides (MEP) immediately containing phosphorylated aa (i.e. uniquely limited by the same score of 0.800 in the two predictions). The constructed histogram and histogram-related distributions based on the product of the resulting two types of scores (s1 × s2) in fact approximated each result of performed double selection as unique value (Figure 6). Unexpectedly, the obtained graphs indicated bimodal-like distribution with a dominant peak located at range of superior score products. This distribution contrasted with the expected random exponential decrease assumed at least in upper half of score product values. Even the comparison with the model constant distribution (stricter evaluation than the assumed random exponential decrease) indicated strong and significant (p < 0.05) association of score products with the restricted region of dominant peak. Based on these facts and considerations concerning Figure 6, we explained the existence of the peak as the consequence of functionally conditioned structural
Figure 5. Existing and predicted antibody-like phosphorylation sites. aThe MPL subsets (subdivided according to types of reselection) were arranged in the separate table segments according to the mean positions of their feedback similarities (m(S)) between MPL and initial MSA record displayed in Figure 2 (cf. footnote i and Figure 1). BOC―brother of CDON, cell adhesion associated molecule, oncogene regulated; BP-binding protein; Copb1-coatomer protein complex, subunit beta 1; EIF4EBP1―Eukaryotic translation initiation factor 4E binding protein 1; Pnliprp1-pancreatic lipase related protein 1; regul.-regulated; Rps6ka4―ribosomal protein S6 kinase, polypeptide 4; Rps6kc1―ribosomal protein S6 kinase, polypeptide 1; SCTR―successful criteria of terminal reselection (for encoding numbers see fifth part of Figure 1(b) in SF3); VEGF― vascular endothelial growth factor. bFrS―frequencies of limited co-localizing similarities (i.e. MUSAS mentioned in Sections 2.4, WP2.3.3-4 and WP5.7); mBS―maximum bit score of the selected BLAST similarities; NS-nucleotide sequence; PS-protein sequence; *-minimum of two alternative score maxima or MUSAS numbers was recorded in cases of two co-evaluated searches. csp-species origin (current items: Hu―human, Mu―Mus musculus; reference NITR-related item: Mm-Miichthys miiuy); str-strategies yielding topical MRNS (sections WP5.6 and WP5.7). Capitals and numbers in abbreviations of Entrez restriction (section WP2.3.1): C, K-cancer- and phosphorylation-related special BLASTN searches; DX―double combined approaches comprising pairs of differently adjusted TBLASTX searches, only if they yield co-localizing similarities limited by 30 or 35 bits; S―global BLASTN searches limited by 35 bits; T―ternary combined approach (TCA, limit 25 bits); X, NX, N after T-TCA accompanying pairs of BLAST searches (selecting repeatedly co-localizing similarities; cf. WP2.3.3), i.e.: 1) two differently adjusted TBLASTX; 2) BLASTN and TBLASTX and 3) two BLASTN differently employing two MNSQ, respectively; 1, 2, 12―items selected by MNSQ1 or MNSQ2 and both MNSQ, respectively. dHaa + Waa-occurrence of effectively located hypermutation motifs (HM*) including W-pairs (for details see sections 3.4 and 4.4). Upper rows: left/right evaluation using plus or minus-WRCH/TCW are present or are not present at the critical positions, respectively. Bottom rows: #aWi, #pWi-W-pairs located at DNA strands anti-parallel or parallel with respect to the direction of transcription, respectively (i denotes the number of unclassified nucleotides inserted between the observed WRCH; if n substitutes the number i, then WRCWRCH is indicated); #aRW, #pRW-palindromic W-pairs RGYWRCY includes WRCY unit critical with respect to aa alteration in antiparallel or parallel (transcribed) DNA strands, respectively. eSEM-sequences enveloping (and including) selected MPL encoded peptides; gray background covering edges of peptide segments-chains extending central MRNS-encoded peptide sequences; underlined-predicted or empirically confirmed phosphorylated aa; CD-conserved domains; cdsBS, cdsE-conserved domain similarity-related bit scores and Expects, respectively. Lower cases (LC) in alphabets denoting peptide sequences indicate aa which could be altered during hypermutation changes via cytidines of HM (HM*-cytidines): LC and LC with―sole HM*―cytidine form WRCH and TCW, respectively; LC with *―two HM*-cytidines are present in two neighbor but antiparallelly located positions of palindromic WRCH; LC followed by dominant #―WRCH containing HM*-cytidine composes W-pair (cf. Sections 3.4 and 4.4). fpos + aa―phosphorylated aa is denoted by a single character, which is accompanied by the number of given aa position in the displayed peptide; NetP, KinP-scores obtained by NetPhos2.0 and KinasePhos2.0, respectively (maximum is equal to unity); spec―specificity of kinase―mediated phosphorylation (identified by KinasePhos2.0). gSeven experimentally proved PPS exhibited overlaps with peptides encoded by MPL (MEP), whereas phosphorylation sites of four these PPS immediately formed MEP. In addition, two proved PPS with phosphorylation sites located at position immediately neighboring to MEP were encoded by sequence containing HM* occurring within MPL. The residual (seventh) confirmed PPS (with the site outside MEP) then overlapped MEP encoded by MPL containing double-active W-pair (i.e. two HM* forming W-pair). F, P-database record of existing PPS was found using Phosida or phospho.ELM, respectively; *―indicator of empirical confirmation recorded in given databases. hOnly nine groups of protein kinases (PK) were predicted or empirically proved to phosphorylate the displayed MEP: Aur―Aurora-related kinase; ATM―Ataxia-telangiectasia mutated (kinase); CK1, CK2-casein kinases 1 and 2, respectively; GSK3-Glycogen synthetase kinase-3; MAP3K-MAPK kinase kinase (=MEKK1); MAPK-mitogen activated PK; PKB-PK B; PKC-PK C; PLK1―polo-like kinase 1, absence of abbreviation in a database-confirmed case―only phosphopeptide was observed without knowledge about PK specificity. iAlign versions of BLAST were used in feedback comparison (see WP2.4.1). O, S, m(S)―MSA-record-related positions of deduced sequence overlaps, feedback similarities, and mean aa position of these similarities (cf. footnote a), respectively. jThe segments of MSA record present in Figure 2 formed feedback (BLASTX, TBLASTN or TBLASTX; cf. section WP2.4.1) similarities at defined IgV-related positions of this “initial” MSA- record (PIM): FR1C―framework region 1 including CSB1-related overlaps (PIM from 5 to 15 - 22); CDR1L―at least 50% overlap of CDR1light (PIM: 23 - 30); CDR1a―C-terminal part of CDR1 in light chains and co-localizing CDR1 of heavy chains (PIM: 31 - 35); CDR1s-prevailing overlaps of CDR1light but short of CDR1all (PIM: 23 - 35); FR2N―N-terminus of framework region 2 (PIM: 36 - 42; the most conserved part of this framework region in Figure 2). *―different classification follows from NS differences. For additional comments see Figure 2, Section WP4.2.2 and SF2.
maintenance. Based on this statistically supported explanation, we derived the limit for score products 0.88, subsequently used in the terminal reselection. For important details see sections WP2.4.1, WP3.2, WP3.3 and file SF3.
3.4. Final Bioinformatic Analysis of the Selected Pairs of MEP and MPL
About eighty percent of regularly restricted human and mouse MEP achieved superior score related to ATM. Similarly to PPSIg, part of ATM-related sites, predicted here, achieved also co-dominant prediction of PPS
Figure 6. Distribution of score products evaluating predicted phoshorylation sites encoded by MPL. Model MPL subset of predicted sequence items was selected before overlap-, Thr-, Tyr- and database-related completions of data (cf. Figure 1 and WP2.4.1). This subset included thirty-two MPL-encoded peptide sequences containing forty predicted phosphorylation sites, which were represented here by the products of their maximum predicted scores (s1*s2). The score maxima were individually obtained with NetPhos 2.0 (s1) and KinPhos 2.0 (s2) in the searches equally restricted by the limiting minimum value 0.800 (cf. Figure 5). Axes x of the displayed graphs comprise the specific score product intervals of 0.01 extent. Upper part includes a current histogram of the score-product-related frequencies. Frequencies of the evaluated score products look like the main peak in the range of upper values. The corresponding significant prevalence to high score products (p < 0.05) perhaps consisted in the existence of phylogenic pressures keeping PPS function and enabled us to empirically assess reselection limit. In addition, we simultaneously diminished effects of possible overestimations (i.e. false negativities) frequently accompanying pre-calculated restriction of combined selection. For other comments see Sections 2.1.2, 3.3 and WP3.2. Middle part contains widely used gliding mean values of frequencies enumerated here based on three neighboring (interval-related) frequencies present in the upper histogram. Each gliding mean value is displayed specifically in the positions of central (second counted) interval yielding thus the simplest form of smoothened histogram. Bottom part contains more adequate/specific form of histogram smoothening based on linear combinations formed by central and neighbor interval related frequencies and Gaussian kernel functions which enumeration takes into consideration empirical statistics of the evaluated differences between the corresponding score pairs s1 and s2 (for details see section WP3.3).
phosphorylated with Aurora and protein kinase B. Interestingly, MPL were not always translated in the same reading frame or orientation as the compared IgV-related segments of initial MSA record. Five MEP formed chains of valid conserved domain similarity with Ig domains. Four of them achieved feedback similarity with the segment CDR1light in initial MSA record. However, feedback similarities overlapping common CDR1all segments were found only in five cases of Ig-unrelated MEP/MPL (cf. Figure 5 and section WP4.2.2; for explanation of CDR1light and CDR1all see Figure 2). Only pairs of human and mouse orthologues of HIPK2 contained equally co-localized MPL in Figure 5.
The DNA-strand-related occurrence of well-known hypermutation motifs (HM;     ) was observed, when selecting HM subset (HM*) located at MPL sequence positions critical with respect to aa alteration. Interestingly, the number of HM* was significantly higher (p < 0.05) in the non-transcribed DNA strands of MPL than in transcribed ones when comparing the strand-related occurrences of HM* with those of HM incapable to alter aa (non-HM*). The corresponding significant increase concerned also certain WRCH-related pairs (W-pairs; OR* = 11.6 but an extremely low set of 22 items) determined by the patterns: RGY-WRCY, WRCWRCH, WRCHNWRCH and WRCHN(9)WRCH (see section 4.4 and previous papers   ). For the graphs of HM*-related OR see Figure 4. The occurrences of HM*-related W-pairs in framework-associated and CDR1-associated regions of MPL (determined by feedback comparison) also differed, determining 3.85 times higher density of such W-pairs in CDR1-associated regions. In addition, we found three types of MPL relationships interesting with respect to HM* occurrence. This concerned: 1) knocking-out alteration of predicted phosphorylated serine in cases of twelve human and mouse MPL and one reference MPL of bony fish origin, 2) distinct relationship of the two HM* (WRCH and TCW) to transcribed DNA strand in site encoding phosphorylated serine following from genetic code and 3) maximum number of HM* per single MPL (two different MPL triads distinctly contained four times such WRCH or such TCW). For additional details and comments see Figure 5, Sections 4.4, WP2.4.1 and WP4.2.1-3.
4.1. Structures Found in Initial MSA-Record
Both sequences of patterns displayed in Figure 2 exhibit certain relationship to the published data. The first two aa of CSB1-related pattern CX(10,13)WXXQXP include well known positional markers restricting CDR1. Cysteine is the current C-terminal aa of FR1 composing light chains of IgV, whereas tryptophan mostly occurs in N-terminus of FR2 in both light and heavy chains of IgV   . Trans-cis isomerization of C-terminal aa of the same pattern, i.e. proline, is probably important for local flexibility of IgV   . Another CSB2-related pattern identified here, LX(8)DX(3)YXC, contains the sequence DX(3)YXC identical with the pattern found in non-vertebrate molecules closely related to AR  .
The marked differences between the occurrence of at least anti-cohesive columns in CDR1 and other two hypervariable regions (cf. Section 3.1) indicate higher conservativeness of usually effectively hypermutating CDR1. If we also consider location of CDR1 in CSB1 and PKSIg    (cf. also Figure 2), the indicated conservativeness supports as an additional fact the model importance of N-terminally located PPSIg region for studies of antibody-like PPS.
4.2. Methodological Aspects
The bilingual approach employed here was in fact based on simultaneous evaluation of 1) occurrence of HM motifs present in MPL composing IgV-related mRNA or cDNAs segments (the first formal language; Figure 1), 2) existing or predicted phosphorylation of MPL encoded peptides extended by their immediately neighboring chains (the second formal language) and 3) possible alteration of aa composing the selected PPS via HM (site specific translation). Due to various actual reading frames of MPL, the bilingual approach determined a larger set of MPL than would simply follow from the searches for protein sequence similarities.
To unify the selection of MRNS (cf. Figure 1), we supplemented here a simplified BLAST derived fuzzy- system associated restriction of CSB (Section 2.2.3) consistent with ELEMS principles   . The unified usage of BLAST-related evaluation in the three starting main steps (cf. Figure 1) does not mean refusing the possible prediction by means of machine-learning methods such as SVM or neural networks  -  . More likely, we assume that some results obtained in our searches represent important starting information for future machine learning. This concerns among others 1) feedback similarities and 2) reevaluation similarity extents and cooperative effects related to different elementary MNSQ units participating in or associated with MPL selection process.
4.3. Statistical and Phylogenic Contexts of Employed Search Procedures
In fact, we did not observe here NS mutations in cell clones, but only potential NS mutation. Consequently, we considered here only inherited changes of DNA, when looking for an explanation of unexpected prevailing occurrence of cancer-related NS items in the human set described in section 3.3. This consideration focused our attention first of all on 1) several events occurring in germ-line cells (GLC) or 2) parameters of these events, i.e.: a) prevention of DNA attacks by oxygen radicals in GLC, b) extents or specificity of meiotic hypermutation (comprising also the described changes in Ig-related HM  ) and its possible effects on genome plasticity, c) percentage of mutated descendants bred by old individuals d) immune surveillance on GLC. Better understanding of the observed and potential relationships, including phylogenically interesting (Elasmobranchii-related) MNSQ1 linkage (cf. Sections 2.3, 3.3 and Figure 4), needs also more detailed analysis on a larger set of PPSIg- related molecules and MPL in future.
In accordance with our working hypothesis  , at least some MPL displayed here (encoding most probably ATM-phosphorylated MEP; cf. section 3.4) and PPSIg segments originate in common ancestral oligonucleotide segments of possible increased mutability. This hypothesis appears to be in agreement with the displayed strong and significant association of score products with superior score values (Figure 6; Sections 3.3 and WP3.2) and existence of four molecules containing MPL/MPL-like segments most frequently (i.e. four times) selected in terminal reselection (see the first column of Figure 5 and section WP4.3). The latter possibility concerns: 1) NITR2 of close phylogenic relationship to AR  otherwise only near the limits of PPS prediction (cf. WP4.3), 2) two molecules containing MPL with Ig domain context molecules (KDR of superior PPS-related scores, oncogene regulated adhesion molecule BOC); 3) MPL of glioma tumor suppressor candidate region gene 1 (Gltscr1). In contrast to other molecules, Gltscr1 did not achieve any conserved Ig domain similarity, the corresponding CDART-indicated relationship to Ig superfamily (IgSF) representatives or local PPSIg-related fold similarity in records obtained with FFAS03 program (cf. section 3.1). To explain this anomalous sequence behavior of Gltscr1, we assumed participation of well-known mechanisms such as a) transposition of short repeats, b) gene conversion and/or c) perhaps also functionally conditioned convergence of DNA segments (cf.  ).
4.4. MPL in the Light of Hypermutation Events
Collisions of transcription apparatus (synthesizing mRNA copy of transcribed strand of DNA) with APOBEC enzymes interacting with HM present in transcribed DNA strand (CTAE) are able to enlarge hypermutation effects of these deaminases. This occurs due to 1) extended low-fidelity repair of generated double strand breaks with error-prone DNA polymerases and 2) accompanying insertion deletion changes  -  . In accordance with this mechanism the observed significant differences in distributions of HM* in transcribed and non-tran- scribed DNA strands (Section 3.4; Figure 4) can be explained by similarly selective CTAE-triggered events. Since the comparison of DNA strands does not concern altered nucleotide sequences, inherited but not somatic changes have to be considered, which looks like a contradiction when dealing with somatic mutation changes. Nevertheless, HM WRCH participates also in inherited meiotic hypermutation via a similar though less frequent mechanism than in case of Ig gene hypermutation  . Consequently (and in accordance with less frequent incidence of meiotic-mutations), the statistics of HM* occurrence in non-altered DNA strands (see Figure 4) suggests parallel somatic mutation changes via CTAE events in at least some transcribed DNA strands encoding MEP-peptides and containing HM*. In accordance with this parallelism, HM* located in their transcribed DNA appear to be more perilous than others, indicating thus MEP of probably increased hypermutation risk in molecules such as human RhoGEF kinase, MAGEE1, KDR, zinc finger protein 687, rhabdomyo-sarcoma FBO1A, mouse anaplastic lymphoma kinase and zinc finger protein 619 and both reference AR-related NITR molecules of bony fish origin present in Figure 5 or Table WPT2 (see section WP4.3).
In accordance with the theories of ageing, abnormal phosphorylation (or loss of phosphorylation) is assumed as an alteration possibly important during ageing  . Abnormal phosphorylation can be among others caused via mutation or hypermutation changes of PPS generating knocked-out, alternatively or weakly reacting PPS. If we assume that such PPS changes frequently occur in different PPS phosphorylated by the same PK, then their change can imitate the loss or decrease of the involved PK activity. In accordance with the given assumption, about eighty percent of MEP appear to be dominant substrates of ATM (section 3.4), which is the reason to consider possible imitation of functional loss of ATM via mutation of these PPS. As well known ATM is involved in regulation of double-strand-break response (cf. Introduction). Functional loss of mutated ATM causes autosomal recessive disease (ataxia-telangiectasia; AT) characterized by median survival 19 - 25 years (a wide range) and death due to cancer and respiratory failure  . In addition, AT is also accompanied by neurodegeneration and immunodeficiency or worsened immune response otherwise frequently appearing in elderly people  . Consequently, a substantial question arises: Can changes in sequences of PPSIg/MPL-related PPS influence important regulatory or even ageing-related effects of ATM?
Some MPL are not translated in the same reading frame as PPSIg (cf. Section 3.4). The corresponding peptides (M_out) thus can avoid cross-reactivity with Ig epitopes, which mostly contain only “in frame” insertions and deletions   . This raises a question of possible usage of at least some mutants of M_out as components of future complex/multicomponent anticancer vaccines. On the other hand, it is a question whether at least some MPL-derived peptides encoded in the same reading frame as PPSIg (M_in) or their mutants can cross- react with rheumatoid autoantibodies. The positive answer could be among others important for protein engineering improving autoantibody specificities with respect to M_in mutants.
Though the databases of empirically proved phosphorylation sites are considerably incomplete, seven MEP contain such database-confirmed sites. The nonrandom main peak of MPL-derived products of phosphorylation- related prediction scores then suggests the phosphorylation of at least some additional MEP (Section 3.3; Figure 6). The significantly prevailing occurrence of HM* and the corresponding W-pairs in non-transcribed DNA strands of MPL indicates certain mutation events in the corresponding gene segments including parallel somatic changes (Section 4.4; Figure 4). In accordance with these parallel changes, MPL segments with HM* present in transcribed DNA strand appear to be most critical (for list see Section 4.4). The occurrence of HM* in MPL meets also three important functional aspects. First of all, most MPL (about 80%) encode peptides identical with or predicted as phosphorylated by ATM molecules, known for their relationship to aging and regulation of DNA double-strand-break response (Sections 3.4 and 4.4). Secondly, cancer-related MRNS are significantly more frequent in humans than in mice (Figure 4; Sections 3.3 and 4.3). Thirdly, feedback similarities of MPL containing W-pairs co-localized mainly with (hypervariable) CDR1 segments of IgV chains forming both MNSQ (Table 1, Figure 2 and Figure 5, Section 3.4). These three functional aspects moreover concern several existing groups of MPL with extreme properties, e.g. 1) MPL with HM* encoding directly predicted phosphorylated amino acid, 2) CDR1-related MPL containing superior numbers of HM* and 3) MPL/MEP pairs frequently selected in the alternative steps of terminal reselection (Figure 5; Sections 4.3, 4.4 and WP4.2.1-3).
Though the actual carcinogenic effect of mutation changes in phosphorylation sites is known for a long time  -  , the set of the corresponding investigated sequences is not sufficiently large. In accordance with this fact, we have not yet found experimental confirmation of mutation changes in the seven existing PPS segments (displayed in Figure 5) in the literature. Consequently, these PPS, as well as the displayed predicted segments, represent an inspiration and challenge for the subsequent experimental researches in specialized laboratories and perhaps also subjects interested in certain bioinformatic trends (cf.   , Sections 4.2 and 4.4; see also below). In our opinion, future experimental research should comprise 1) comparative sequence studies of ATM- related MPL composing not only DNA of cancer patients (for the corresponding methods see also our papers  -  ) but also DNA of old people and 2) usage of phage displayed libraries or protein engineering in case of the considered immunological relationships (  -  and Section 4.4). Theoretically based bioinformatic investigation of the immunological relationships could moreover efficiently select the subset of possible candidates for vaccine-related promiscuous epitopes specifically recognized by cytotoxic T cells (  -  and Section 4.4) even in case of more extended HM*-related set of MEP mutants than is that displayed in Figure 5 (for other corresponding bioinformatic aims and trends see section WP5.8). In addition to the preceding medicinal aspects, an interesting question refers to a possible phylogenic relationship between IgV domains and certain NS encoding peptide segments a) containing several near ATM-related PPS and b) composing for instance immunoglobulin-like archaeal surface layer proteins or related metazoan cell surface proteins   .