Novel Quantitative Approach for Predicting mRNA/Protein Counts in Living Cells

Show more

1. Introduction

Randomness, or noise, in biological systems has long been predicted from basic physical principles [1] - [8] and later on by observations of phenotype heterogeneity [7] . But the confirmation came later with [9] , [10] and [11] who showed that mRNA and protein variability may lead to important a source of noise in biology. Researches in [12] , [13] , [14] and [15] have reported that the number of proteins translated from an mRNA obeys a geometric distribution but the distribution describing the number of protein remaining once mRNA is degraded will no longer be geometric. Various techniques have so far been used to monitor and capture those numbers among which fluorescent probes or green fluorescent protein variants which allow the quantification of protein levels in living cells by flow cytometry or fluorescence microscopy [16] [17] . The first quantitative study collectively examines the noise associated with the principal step of central dogma of molecular biology in replication, gene activation, transcription, translation and the enslaving intracellular environment, and suggested that autorepression of replication and transcriptions suppresses noise. This then leads to examination (by analysis, modelling and simulation) of the role of noise in biology relying on the similarity between biological and engineering systems―see [7] , [10] and [18] . In general, noise may be considered either intrinsic or extrinsic to a specific gene circuit, and within a specific gene circuit there are three different effects of noise: i) noise is negligible with little or no influence over function; ii) noise is detrimental to function and gene circuit; iii) noise is important for circuit function, and by using simple assumptions, it is possible to evaluate these effects. The assumption we use in this paper is dynamic correlation between the noise level of molecules (mRNA/protein) and the change in the probability of having those molecules in given interval of time. Our paper is organised as follows. In Section 2, we introduce our model of the dynamic of the number of mRNA and proteins after a brief review of previous models. In Section 4, we present our method and algorithm for solving the (mRNA and protein) prediction problem. In Section 5, we present the simulation results, followed by a discussion of those results, and end this work in Section 6 with a short conclusion.

2. Somes Examples and Motivations

2.1. Birth-Death Model

To understand noise in biological systems, biochemical circuits and genetic networks are often used as the measured noise properties to elucidate the structure and the function of the underlying gene circuit [6] [8] . Also recent researches [13] and [14] have clearly established the existence of dynamic correlation between genetic network and mRNA/protein variability. In the next section we will present previous models with their strengths and weaknesses. The preliminary model used was a simple birth-death Markov process which captures noise in a biochemical process. This model showed that noise in the population was a consequence of the change in the parameters of the system over time and was used to explore the temporal change of the number of proteins in a biological system. The time course of the number of proteins was modelled consequently by the equation

$\frac{\text{d}n\left(t\right)}{\text{d}t}=\alpha -\gamma n\left(t\right)$ (1)

with parameters $\alpha $ representing the rate of production and $\gamma $ the rate of decay of number of proteins $n\left(t\right)$ . However, such continuous time formulation neglects the discrete nature of proteins and the random timing of molecular transition [17] because the actual time evolution may follow any one of a number of trajectories, and hence sufficiently many trajectories have to be examined to obtain statistics that converge. In the next section a probabilistic approach using the extended versions of Kolmogorov’s equations is used to explore randomness in the system.

2.2. Kolmogorov’s Equations Based Model

In general, the Kolmogorov’s equations are used as master equation to capture the distribution of chemical components of the gene circuit over time. The state of the system is defined by a vector $n\left(t\right)=\left({n}_{1},{n}_{2},\cdots ,{n}_{N}\right)$ , where ${n}_{i}\left(t\right)$ represents the i-th component of molecule n at time t; ${a}_{i}$ and ${v}_{i}$ are internal parameters representing respectively the propensity of the dynamic and the actual change in ${x}_{i}$ , resulting from the change in the previous state The probability, $p\left(n,t\right)$ , that the system evolves into the state $n\left(t\right)=\left(n,t\right)$ at time t is described by the following partial differential equation:

$\frac{\partial p\left(n,t\right)}{\partial t}={\displaystyle \underset{j=1}{\overset{N}{\sum}}{a}_{j}\left(n-{\nu}_{j}\right)p\left(n-{\nu}_{j}\right)}-{a}_{j}\left(n\right)p\left(n\right)$ (2)

This equation makes sense only if we assume that the probability for two or more reactions to occur in the time interval
$dt$ is negligible compared to the case when only one reaction occurs. In addition, (2) can only be solved numerically for relatively simple systems. In a recent work by [15] , a similar mathematical model was used for gene expression and an approximate solution was proposed to the

$\begin{array}{l}\frac{\partial {P}_{m,n}}{\partial t}={\nu}_{0}\left({P}_{m-1,n}-{P}_{m,n}\right)+{\nu}_{1}\left({P}_{m,n-1}-{P}_{m,n}\right)\\ \text{}+{d}_{0}\left[\left(m+1\right){P}_{m+1,n}-m{P}_{m,n}\right]\\ \text{}+{d}_{1}\left[\left(n+1\right){P}_{m,n+1}-n{P}_{m,n}\right]\end{array}$ (3)

The meanings of the rates in (3) are:
${\nu}_{0}$ is the probability per unit time of transcription,
${\nu}_{1}$ the probability per unit of translation,
${d}_{0}$ the probability per unit time of degradation of an mRNA, and
${d}_{1}$ the probability per unit time of degradation of a protein. The authors use a particular generating function and transform (3) into a first order

3. The New Model

Our setup is motivated by the necessity to overcome the limitations from the previous models by increasing the cell numbers and relaxing the restriction on constant parameters. We propose a new, flexible, and more general, model for a population of N cells. This model is an extended version of the previous Kolmogorov’s equation with additional cell-dependent constraints.

$\{\begin{array}{l}\frac{\partial p\left(m,t/{m}_{0},{t}_{0}\right)}{\partial t}={\displaystyle \underset{j=1}{\overset{N}{\sum}}{a}_{j}\left(m-{\nu}_{j}\right)p\left(m-{\nu}_{j},t/{m}_{0},{t}_{0}\right)}-{a}_{j}\left(m\right)p\left(m,t/{m}_{0},{t}_{0}\right)\\ \frac{\partial p\left(n,t/{n}_{0},{t}_{0}\right)}{\partial t}={\displaystyle \underset{j=1}{\overset{N}{\sum}}{a}_{j}\left(n-{d}_{j}\right)p\left(n-{d}_{j},t/{n}_{0},{t}_{0}\right)}-{a}_{j}\left(n\right)p\left(n,t/{n}_{0},{t}_{0}\right)\text{}\end{array}$ (4)

The parameters of the model have an autoregressive form:

$\{\begin{array}{l}{a}_{j}\left(m\right)={\vartheta}_{1}{a}_{j-1}\left(m\right)+{\theta}_{1}\\ {a}_{j}\left(n\right)={\vartheta}_{1}{a}_{j-1}\left(n\right)+{\theta}_{1}\end{array}$ (5)

The transcription, translation and degradation rates are assumed to vary from one cell to another as

${\nu}_{j}={\nu}_{0}{\text{e}}^{-0.005j}$ and ${d}_{j}={d}_{0}{\text{e}}^{-0.001j}$ (6)

We assume for $k\in \left[0,N\right]$ , the first ${\nu}_{1},\cdots ,{\nu}_{k}$ are sequences of transcription rates and the late ${\nu}_{k+1},\cdots ,{\nu}_{N}$ are sequences of translation rates with ${\nu}_{0}$ being the fixed initial rate. Our model, which is composed of the Equations (4)-(6), is well adapted to various real biological promoter change. We shall notice that Equation (4) is a system of 200 equations with 100 by 2 unknowns, which is likely to be only numerically solvable after some good approximations. To efficiently predict the number of mRNAs and proteins over time, we shall rely on the following assumptions.

Proposition 1. Over time the number of mRNAs/Proteins is perfectly correlated with the probability mass functions of mRNAs m(t) and proteins n(t) respectively. That is, $m\left(t\right)=p\left(m,t\right){m}_{0}$ and $n\left(t\right)=p\left(n,t\right){n}_{0}$ , where ${m}_{0}$ and ${n}_{0}$ are initial measurements.

Proof: The proof follows from our algorithm and solution in this paper. ,

Proposition 2. Let $n=n\left(t\right)$ be the number of proteins and $\eta =\eta \left(n,\Delta t\right)$ be the noise generated by $n$ proteins (or m for mRNAs) in the same time interval $\Delta t$ . Then there exists a unique constant C such that $\eta \left(n,\Delta t\right)=Cp\left(n,\Delta t\right)$ which means that noise is cells is proportionally correlated to the probability distribution of protein and mRNA numbers.

Proof:

Let $\eta =\eta \left(n,\Delta t\right)$ , $\Delta n\left(t\right)=n\left(t\right)-n\left(t-1\right)$ be respectively the noise and the number of proteins in a cell. By the simple decomposition of numbers of mRNA/ proteins, $p\left(\Delta n\left(t\right)\right)=p\left(n,\Delta \left(t\right)\right)$ and $p\left(\Delta n\left(t\right)\right)=p\left(n\left(t\right)\right)-p\left(n\left(t-1\right)\right)$ , (by the additivity property of probability distribution. We also have, using the

definition, that $\eta =\frac{\Delta n}{\Delta t}$ , $p\left(\Delta n\right)=\frac{\Delta n}{N}$ and $\sum n}=N$ . This implies that $p\left(\Delta n\left(t\right)\right)=\frac{\Delta n}{N}\Rightarrow Np\left(\Delta n\left(t\right)\right)=\Delta n\left(t\right)=n\left(t\right)-n\left(t-1\right)$ multiplying the right side of above with $\frac{\Delta t}{\Delta t}$ and we obtain $N\times p\left(\Delta n\left(t\right)\right)=\frac{n\left(t\right)-n\left(t-1\right)}{\Delta t}\Delta t$ .

since $\eta \left(n,\Delta t\right)=\frac{n\left(t\right)-n\left(t-1\right)}{\Delta t}$

and $N\times p\left(\Delta n\left(t\right)\right)=\eta \left(n,\Delta t\right)\times \Delta t$

thus $\frac{N}{\Delta t}\times p\left(\Delta n\left(t\right)\right)=\eta \left(n,\Delta t\right)$

leading to $C\times p\left(n,\Delta \left(t\right)\right)=\eta \left(n,\Delta t\right)$

Finally we conclusion that $C=\frac{N}{\Delta t}$ . ,

Here we put $\Delta t=\frac{T}{{N}_{0}}$ where $T$ is the total time, ${N}_{0}$ the total number of

points in the simulation and $N$ is the total number of mRNA or proteins in a single cell. In the next section we introduce our method and algorithm for solving Equations (4)-(6).

4. Method and Justification

We propose a straightforward method of solving the above problem based on numerical approximation via the following algorithm. As the analytical solution to Equation (4) is (at least) hard to obtain, even for a “reasonable” number of cells, a numerical algorithm using an adapted stochastic simulation approach is proposed in this paper. In our algorithm, two random variables $m\left(t\right)$ and $n\left(t\right)$ determine the temporal evolution of the system. The variable ${\tau}_{k}$ is the time for the next event to occurs, the probability density of an event (appearance of m(t) or n(t)) is evaluated based upon our model (4), so as to give a better flexibility and applicability to the approach in comparison with previous ones. The main purpose of creating such an algorithm is to simultaneously simulate the process noise, while predicting the online probability mass function (»probability density) of each event over time. An important assumption here is that the hypothetical probability distribution functions (p.m.fs) of the translation and transcription rates are of the form ${\nu}_{j}~N\left(2,0.05\right)$ and the mRNA and protein degradation rates are ${d}_{j}~N\left(2,0.05\right)$ . This is in line with the existence of a one-to-one relation between the dynamic and distribution for predictable dynamic systems. We will present our algorithm in the next section of our work.

Our Algorithm

Input: Initial data ${m}_{0},{n}_{0}$

Outputs: ${P}_{m},{P}_{n}$

1. Set ${a}_{0}\left(m\right):={m}_{0},{a}_{0}\left(n\right):={n}_{0}$ .

2. For j = 1:k do [k = number of iterations]

a. Let ${\nu}_{j}~N\left(2,0.05\right),{d}_{j}~N\left(2,0.05\right)$ be the changes associated to a single event;

b. Compute ${a}_{j}\left(m\right)=\theta {a}_{j-1}\left(m\right)+\vartheta ,{a}_{j}\left(n\right)=\theta {a}_{j-1}\left(n\right)+\vartheta $ ;

c. Compute ${\alpha}_{m}\left(j\right)=\vartheta {\alpha}_{m}\left(j-1\right)+\theta ,{\alpha}_{n}\left(j\right)=\vartheta {\alpha}_{n}\left(j-1\right)+\theta $ ;

d. Compute ${P}_{m}\left(j\right)=\frac{-{v}_{j}\left({\alpha}_{m}\left(j\right)-{v}_{j}\right)+{\xi}_{1}\left(j\right)}{1+{v}_{j}},{P}_{n}\left(j\right)=\frac{-{d}_{j}\left({\alpha}_{n}\left(j\right)-{d}_{j}\right)+{\xi}_{2}\left(j\right)}{1+{d}_{j}}$ , where ${\xi}_{1}\left(j\right)~Po\left(10\right),{\xi}_{2}\left(j\right)~Po\left(10\right)$ ;

e. Normalize ${P}_{m},{P}_{n}$ .

3. Output ${P}_{m},{P}_{n}$ .

End

5. Simulations and Results

The initial data here is a matrix of randomly generated numbers between one and fifty for mRNAs and between one and forty for proteins. The rows represent the cell numbers and the columns are the number of mRNAs/proteins counted at each time interval. Therefore we have 100 cells (population) and 50 samples taken at a time interval of one unit, and the total time of 50 time units in the entire population; (a unite could be second, minute or hour depending on the experiment). The bar and image pots of the initial data are shown in Figure 1.

Figure 1. These subfigures give plots of the initial randomly generated data with 50 samples for proteins. A sample is the number of protein in cells for a fixed interval of time. This image plot support the presence of various level of noise in the biodynamic system.

5.1. Results

Our results will show various figures related to our solutions. We first plot the variability of the number of protein in cells over time for a sample of 50.

Next, we plot the solutions of (4) over time and explain their relevance for our work. pmf (probability mass function) of the mRNA and proteins in separate graphs for each sample, and further we plot the histograms of the distribution and finally the scatter plot of ${P}_{n}$ against ${P}_{m}$ . Our observations are presented in the caption of each figure.

It can be seen that all probability values are between 0.1 and 0.9 and do not overlap in most of the cases; this is an indication that mRNAs and proteins number may be dynamically dependent, and therefore correlated. Next, we predict the number of mRNAs and proteins ${m}_{j},{n}_{j}$ using a straightforward probabilistic concept which states that “a good value of m (or n) depends on a good guess of p”. The prediction for the number of mRNA and Protein ${m}_{j},{n}_{j}$ (for iteration $j=1,2,3,\cdots ,100$ ) are then given by the following Markov equations.

${m}_{j}=\{\begin{array}{l}{P}_{m}\left(0\right)*{m}_{0};\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{if}\text{\hspace{0.17em}}j=0\\ {P}_{m}\left(j\right)*{m}_{j-1};\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{if}\text{\hspace{0.17em}}j>0\end{array}$ (7)

${n}_{j}=\{\begin{array}{l}{P}_{n}\left(0\right)*{n}_{0};\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{if}\text{\hspace{0.17em}}j=0\\ {P}_{n}\left(j\right)*{n}_{j-1};\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{if}\text{\hspace{0.17em}}j>0\end{array}$ (8)

Leading to the following results for mRNA

5.2. Entropy Distribution

To measure the uncertainty associated with each sample of mRNA or proteins count, we introduce the concept of entropy over a population, which is calculated as follows:

(for mRNAs) $H\left(m\right)=-{\displaystyle \underset{j=1}{\overset{N}{\sum}}p\left({m}_{j}\right)\mathrm{log}\left(p\left({m}_{j}\right)\right)}$ (9)

(for Proteins) $H\left(n\right)=-{\displaystyle \underset{j=1}{\overset{N}{\sum}}p\left({n}_{j}\right)\mathrm{log}\left(p\left({n}_{j}\right)\right)}$ (10)

Computational results are shown in the figures below in the discussion section.

6. Discussion

We have shown (Figure 2) that, one may calculate the distribution of the number of mRNAs and proteins during gene expression, according to our model in Section 3. Based on these distributions (Figure 3 and Figure 4) we were able to predict the number of proteins and mRNAs over time. We use two main assumptions: i) The initial number of mRNA and proteins must be known; and ii) all cells must present similarity (functional, structural, architectural and/or dynamical). Our results show that both the protein and mRNA distributions are typically non-symmetric and may not be unimodal (Figure 5, Figure 6 and Figure 7). Consequently the mean and the mode are significantly different, and

Figure 2. These figures are a recorded solution of the main

Figure 3. Shows scatter plots of the probability distribution of both number of mRNA and proteins over time for four different samples, plotted on the same graph. Each column represents the number of mRNA or proteins in all cells at a specific time period.

Figure 4. Four steps ahead prediction of mRNA numbers in all cells using Equations ((7) and (8)), for j = 1, 2, 3, 4. This result shows a fast decrease of probability values of mRNAs in cells iteration, indicating that mRNAs have a short life time, which is in accordance with biological evidence.

Figure 5. Shows the frequency of proteins related to P_{n} values in four generations. The optimal for proteins is around 0.4 (except generation 3), this indicates that on average 40% of probability level will give a better proteins count over time. The frequency distribution shows in all cases an asymmetric distribution, which indicates protein numbers are not normally distributed.

Figure 6. Shows the online scatter plots of ${P}_{n}$ against ${P}_{m}$ in four different cells. These figures confirm that both processes are strongly correlated over time in each of the cells, indicating that mRNA and protein dynamics (count) are depend over time. This will also hold for m and n since probabilities and mRNA/proteins are correlated.

Figure 7. Shows the plot of entropy distributions over time in a chosen cell. It can be seen that the maximum entropy reached earlier for proteins and that the right tail is also longer in protein compared to that of mRNA. This may suggest that proteins have a longer life time, compared to mRNA. This evidence is in line with biological knowledge. The next section gives some discussion of the results.

the standard deviation is clearly not constant over time. Such distributions are poorly characterized by Gaussian characteristics. This paper was primarily designed to promote a modelling culture among noise biologists, modellers and to cope with the noise source and consequences in cell development.

7. Conclusion

The advantage of counting single molecules (mRNAs or proteins) is that, one obtains the probability distribution of molecules corresponding to each stage of the “central dogma” of molecular biology for each single gene. The mathematical model developed here differs from those that cellular biologists are accustomed to encountering [3] [5] . Instead of having a continuous and deterministic model of kinetic behavior, the mathematics of gene expression may be described by discrete stochastic models that take into account the numbers of molecules involved at both the mRNA and protein levels variability. Figure 7 shows the plot of entropy distributions over time in a chosen cell. We have found that the maximum entropy reached earlier for proteins in comparison to mRNAs, the right tail density is also longer in protein in comparison of that of mRNA. This result clearly suggests that proteins have a longer life time, compared to mRNA. This evidence is in line with biological principles.

Acknowledgements

The authors would like to thank Dr. Sakumura for dedicating his precious time, giving many insightful comments and suggestions. This work was supported by the GCOE International Senior Research Fellowship, NAIST and the Grant of Aid from the Ministry of Education, Culture, Science and Technology (MEXT) Japan. We also thank the Universities of Sorbonne (France) and AUAF for their generous support and collaboration.

References

[1] Elowitz, E., Levine, A., Siggla, E. and Swain, P. (200) Stochastic Gene Expression in a Single Cell. Science, 297, 1183-1186. https://doi.org/10.1126/science.1070919

[2] Fraser, H., Hirsh, A., Glaever, G., Kumm, J. and Eiseen, M. (2004) Noise Minimization in Eukaryotic Gene Expression. PLOS Biology, 2, 834-838.

https://doi.org/10.1371/journal.pbio.0020137

[3] Gillepsie, D.T. (1997) Exact Stochastic Simulation of Coupled Chemical Reactions. The Journal of Physical Chemistry, 81, 2340-2361.

[4] Jimbo, H.C. and Craven, M.C. (2011) Unconstrained Optimization in a Stochastic Cellular Automata System. Journal of Nonlinear Analysis and Optimization, 1, 103- 110.

[5] Mc Adams, H.H. and Arkin, K. (1997) Stochastic Mechanisms in Gene Expression. Proceedings of the National Academy of Sciences of the United States, 94, 814-819.

https://doi.org/10.1073/pnas.94.3.814

[6] Newman, J.R., Ghaemmaghami, S., Ihmels, J., Breslow, D., Noble, M., DeRisi, J. and Weissman, J. (2006) Single-Cell Proteomic Analysis of S. Cerevisiae Reveals the Architecture of Biological Noise. Nature, 441, 840-846.
https://doi.org/10.1038/nature04785

[7] Novick, A. and Weiner, M. (1957) Enzyme Induction as an All-or-None Phenomenon. Proceedings of the National Academy of Sciences of the United States, 43, 533- 566.

https://doi.org/10.1073/pnas.43.7.553

[8] Ozbudak, E., Thattai, M., Grossman, A. and Van Oudenaarden, A. (2002) Regulation of Noise in the Expression of a Single Gene. Nature Genetics, 31, 69-73.
https://doi.org/10.1038/ng869

[9] Raser, J.M. and O’Shea, E.K. (2004) Control of Stochasticity in Eukaryotic Gene Expression. Science, 304, 1811-1814. https://doi.org/10.1126/science.1098641

[10] Thattai, M. and Van Oudenaarden, A. (2001) Intrinsic Noise in Gene Regulatory Networks. Proceedings of the National Academy of Sciences of the United States, 98, 8614-8619. https://doi.org/10.1073/pnas.151588598

[11] Thattai, M. and Van Oudenaarden, A. (2004) Stochastic Gene Expression in Fluctuating Environments. Genetics, 167, 523. https://doi.org/10.1534/genetics.167.1.523

[12] Peccoud, J. and Ycart, B. (1995) Markovian Modelling of Gene Products Synthesis. Theoretical Population Biology, 48, 222-234. https://doi.org/10.1006/tpbi.1995.1027

[13] Pedraza, J.M. and Paulsson, J. (2008) Effects of Molecular Memory and Bursting on Fluctuations in Gene Expression. Science, 319, 339-343.
https://doi.org/10.1126/science.1144331

[14] Rosenfeld, N., Young, J., Alon, U., Swain, P. and Elowitz, M. (2005) Gene Regulation at the Single-Cell Level. Science, 307, 1962-1965.
https://doi.org/10.1126/science.1106914

[15] Shahrezaei, et al. (2008) Colored Extrinsic Noise Fluctuations and Stochastic Gene Expression. Molecular Systems Biology, 4, 1-19.
https://doi.org/10.1038/msb.2008.31

[16] Bengtsson, M., Hemberg, M., Rorsmsn, P. and Stahlberg, A. (2008) Quantification of mRNA in Single Cells and Modelling of RT-qPCR Induced Noise. BMC Molecular Biology, 9, 63. https://doi.org/10.1186/1471-2199-9-63

[17] Schrodinger, E. (1994) What is Life? Cambridge University Press, Cambridge.

[18] Pedraza, J.M. and Van Oudenaarden (2005) Noise Propagation in Gene Networks. Science, 307, 1965-1969. https://doi.org/10.1126/science.1109090