In a standard way (see   for definitions), let M be a Markov control process at discrete time with infinite horizon (also called Markov decision processes) and let be its approximation. We will use the performance criterion (objective function) called expected total discounted reward. Suppose that the optimal control problem for has a solution, that is, we can find an optimal solution ( ) for the approximate process . Now, if for some reasons (some of these causes are discussed later), it is not possible to find an optimal policy for the original process M, we could use the policy ( ) to control the original process M. The use of such approximation will cause a reduction in the total discounted reward, this reduction is measured by the stability index (Δ), see    for definition. The importance of this stability index is that it allows us to calibrate the use of ( ) to control the original process M.
Clearly, if this stability index is very high ( ), it would imply that it is not optimal to use the optimal policy ( ) to control the process M, on the other hand if this stability index is low ( ), then the use of this approximation is valid.
In the available literature, both the study and calculation of the stability index has been carried out from a theoretical approach through different way: with the application of contractive operators, see for example   ; with the use of certain ergodicity conditions, see   ; and with the application of the use of different probabilistic metrics, see  for definitions of the different kinds of probabilistic metrics, so for example, in  the total variation metric is used, in  the Kantorovich metric is used, in  and  the Prokhorov metric is used.
The results obtained in all the papers mentioned above are an upper bound for the stability index, which is a function of certain parameters and some probabilistic metric, that is
where is an explicit constant and μ is a certain probability metric.
Clearly, the discount factor (α) involved in optimization criteria is also involved in the explicit constant of in inequality (1): Our goal is to determine the behavior of the stability index as a function of ( ) when the discount factor tends to 1 ( ).
Unlike the theoretical study of the stability index as presented in inequality (1), in this work, the stability index will be studied with a more applied perspective.
In this work, a Markov control process about consumption-investment is presented (with expected total discounted reward), for which the stability index is explicitly obtained and later we study its asymptotic behavior when the discount factor tends to 1. These asymptotic evaluations for the stability index will be carried out using techniques statistics; as mentioned above, our goal is to measure the sensitivity of the stability index as a function of ( ) when ( ).
To achieve the above, instead of using inequality (1), we will use statistical techniques to estimate the following model:
, , , (2)
where and κ are the (unknown) parameters of the model, but estimable from simulated data of the discount factor α and using the simple linear regression analysis technique. From Equation (2), we will say that the stability index is of order with respect to ( ) and we will express this as .
Now, if clearly for high values of κ the stability index given in Equation (2) will tend to increase rapidly, which will be indicative that it will not be optimal to use policy to control the original process M.
The numerical experiments carried out in this work have the goal of estimating the sensitivity κ of the stability index given in Equation (2) when . These asymptotic evaluations will give us information to answer the question posed above. In the rest of this document, we will refer to this sensitivity (κ) as the order of Δ, indistinctly.
As far as we go in the literature review, no numerical studies, no simulations, etc. were found that use statistical techniques to evaluate the order of the stability index with respect to the discount factor.
The results obtained in this work using the simple linear regression model technique depend on the value of a parameter involved in the discounted reward function used, however, it is clear that results show that when , then the stability index as a function of ( ) tends to increase rapidly, so it is not recommended to use an approximate optimal policy ( ) to control the original model M. The results also suggest that the selection of the value of the parameter used in the reward function as well as the value of the discount factor are very important to validate the use of the optimal policy to control M. From the results obtained in the estimates of κ, the largest value was −1.75, i.e., , although clearly from the Equation (2), it would seem natural that the best possible order should be at most .
Finally, we would like to comment on the reasons why we propose the model given in (2) for the asymptotic study of the stability index.
In   and , the stability index is studied under expected total discounted cost criterion and the results found are stability inequalities such as the one given in (1). Furthermore, the constant involved in inequality (1) is an explicit and inversely proportional function to the term ( ) in all cases; for example, in  it is found that using the Kantorovich metric, while in  it is obtained that with the total variation metric, and  shows a result in which using the Prokhorv metric. So, given that in this work a control process is studied with the expected total discounted reward criterion and based on the aforementioned results, it seems natural to propose the use of the model given in Equation (2) for the study of the asymptotic evaluations of the stability index. In   and  there are also stability inequalities like the one given in inequality (1), but using the average-cost criterion, however in these papers the stability index presented an order of , where δ is the ergodicity parameter and .
This work is organized as follows. Section 2, a brief description of Markov control models (also called Markov decision processes) is presented as well as some well-known results for discounted optimal control problem with bounded reward; in Section 2.1, we present the problem of estimating the stability index as well as the assumptions that guarantee the existence of the optimal solution for the original (M) and approximate processes ( ), respectively. In Section 3, the control process with which it will work (consumption-investment) is presented, while in Section 3.1 its stability index is explicitly obtained; in Section 3.2, the results obtained regarding the asymptotic evaluation of the stability index are presented. Finally in Section 4, the conclusions of this work are presented as well as some proposed future researches.
2. The Discounted Reward Criterion
For a topological space , denotes the Borel σ-algebra generated by the topology τ and measurability will always mean Borel measurability. Moreover, is the class of measurable functions on whereas is the subspace of bounded measurable functions endowed with the supremum norm given as , . The subspace of bounded continuous functions is denoted by . For a subset , stands for the indicator function of , i.e., for and for . A Borel space is a measurable subset of a complete separable metric space endowed with inherited metric.
be the standard Markov control model (see  , for definitions). That is thought as a model of a controlled stochastic process , where the state process takes values in the Borel space and the control process takes values in the Borel space . The controlled process involves as follows: at each time , the contolled observes the system in some state and choose a control from the admissible control subset , which is assumed to be a Borel subset of . It is also assumed that the admissible pairs set belongs to . Then, the controller receives a reward where is a real-valued Borel measurable function defined on . Moreover, the controlled system moves to a new state according to the distribution measure , where is a stochastic kernel on given , that is, is a probability measure on for each pair , and is a Borel measurable function on for each Borel subset of . Then, the controlled choose a new control receiving a reward and so on.
Let for and . Observe that a generic element of has the form where for and . A control policy is a sequence where is a stochastic kernel on given satisfying the constraint for all , . Now, let be the class of all measurable functions such that for each . A control policy is said to be (deterministic) stationary if there exists such that the measure is concentrated at for each and . Following a standard convention, the stationary policy π is identified with the selector f. The class of all policies is denoted by Π and the class of all stationary policies is identified with the class .
Let be the canonical sample space and the product σ-algebra. For each policy and “initial” state there exists a probability measure on the measurable space that governs the evolution of the controlled process .
The expected total discounted reward criterion is given as
where the discount factor is fixed and denotes the expectation operator with respect to the probability measure .
The optimal control problem is to find a control policy (if exists) such that
for all .
The policy is called discounted optimal policy, while is called the discounted optimal value function. Later we will impose conditions that guarantee the finiteness of the value function and the existence of an optimal policy .
2.1. The Stability Index and the Problem of Its Estimation
The problem of (quantitative) estimation stability (“continuity” or “robustness”) arises when there is an uncertainty about the stochastic kernel defined in the standard Markov control model M (see model (3)). The “original” task of the controller consists in the search for the optimal policy (see Equation (5)). In many applications this task cannot be fulfilled directly due to any of the following causes:
1) Frequently or some of its parameters are unknown to the controller, and this stochastic kernel is estimated using some statistical procedures. With the results of these estimates, another stochastic kernel is generated that is interpreted as an accessible approximation to the unknown .
2) There are situations where is known but too complicated to have any hope of solving the control policy optimization problem. In such cases, is sometimes replaced by a “theoretical approximation” , which results in a controllable process with a simpler structure.
We assume that is not available to the controller and it is substituted by a given approximating stochastic kernel , , and . The “approximating” Markov process governed by will be denoted by , i.e., let
be the “approximate” for the Markov control model given in model (3).
Changing for in Equation (4), we get the discounted reward criterion for the approximate process . Now, suppose that it is possible (at least theoretically) to find an optimal policy for process , i.e.,
The control policy defined in Equation (7) is used as the approximation to the optimal non-accessible policy (assuming it exists). In other words, policy is used to control the original process M instead of policy .
The reduction in reward for such an approximation, is estimated by the following stability index (see    ):
, . (8)
The stability estimation problem consists of searching for inequalities of the following type:
, . (9)
where is a function with explicitly calculated values; is a real continuous function such that as and μ is a metric probabilistic on the space of probability measures.
The results obtained in  -  provide inequalities as given in inequality (9).
In this paper, we consider a particular example of a Markov control process for which optimal stationary policies can be explicitly calculated. The explicit form of these stationary policies (for the “original” process M) and (for the “approximate” process ) makes it possible to explicitly calculate the stability index . The goal of this work is to study the asymptotic behavior of when . Using direct calculations and numerical approximations, we will show that the stability index (see Equation (8)) can be expressed as a function that depends on ( ) and has an order of κ, i.e.,
, , , (10)
where the (unknown) parameters and κ will be estimated using statistical techniques, see the analogy with Equation (2).
To finish this section, the assumptions that guarantee the existence of the stationary optimal control policy ( and ) for the optimal control problems given in equations (5) and (7) respectively, are shown below:
Assumption 2.1. (Existence)
1) The function is bounded by a constant b > 0;
2) is non-empty compact subset of for each and the mapping is continuous;
3) is a continuous function on ;
4) is weakly continuous on , that is, the mapping
is continuous for each function .
The second set of assumptions guarantees the discounted reward criterion is both well defined and finiteness.
Assumption 2.2. (Finiteness)
The following holds for each :
1) The function is bounded by a constant b > 0;
2) is a non-empty compact subset of ;
3) is a continuous function on ;
4) is strongly continuous on , that is, the mapping
is continuous for each function .
For more information see   . Now, if denote either or depending on whether Assumption 2.1 or 2.2 is being used, respectively; then, under either one of Assumption 2.1 or 2.2, the dymanic programming operator:
,is a contraction operator from Banach space into itself with contraction factor α (see  ).
Remark 2.3. Under Assumptions 2.1 and 2.2, there is a solution to the optimal control problem given in Equation (5); which is unique and the value function does not depend on the initial state of the process. For a proof, see  or .
3. A Markov Control Consumption-Investment Process and Its Approximation
This example is presented in . Consider the following Markov control process:
Let ; ; , . The dynamics of the “original” process (M) is given by:
, for ; (14)
and for the “approximate” process ( M ˜ )
, for ; (15)
where and are two sequences of independent and identically distributed non-negative random variables (i.i.d), which have distributions and respectively. Clearly, and are in the space of all distributions in .
In this model, is interpreted as current capital. Amount represents what is invested in assets (such as stocks, bonds, etc.), which generate a profit/loss given by . The rest of the capital is dedicated to consumption and the satisfaction (or benefit) of this consumption is estimated by the utility function given by , where is a given parameter.
The reward function per unit of time is given by
for ; . (16)
Assumption 3.1. (Only for this example)
The i.i.d random variables , given in Equations (14) and (15) respectively, satisfy the following (for details, see  ):
; . (17)
Now, for an “initial” state the optimal control problem (see Equation (5)) for this Markov control consumption-investment process is
analogously for the “approximate” process, we have
where is an “initial” state for the “approximate” process.
Under these conditions, in  it is shown that the processes are given in equations (14) and (15) satisfies both Assumptions 2.1 and 2.2 and that fulfill the following:
1) The optimal stationary policy for Equation (18) is the following selector
, . (20)
2) The value function given in Equation (18) is
, . (21)
3) The optimal stationary policy for Equation (19) is the following selector
, . (22)
The next thing is that we explicitly calculate the stability index for this control process, which we will use to perform the asymptotic evaluations. In the next section, we show the development we did to obtain this calculation.
3.1. Explicit Calculation of the Stability Index for the Markov Control Consumption-Investment Process
In this section, the stability index ( ) is explicitly calculated for the control consumption-investment process which was presented in the previous section. As was mentioned in the introduction section, the expression that we find for the stability index is a function of the parameters p and , where is the measure of the approximation between the probability distributions and (see Equations (14) and (15)), while p is the parameter involved in the reward function (see Equation (16)).
In economics, this parameter p is associated with elasticity, that is, elasticity measures the percentage change in the consumer’s utility in response to percentage changes in the consumer’s money supply (for more details, see  or  ). For this reason, it is important to measure its effect on the asymptotic behavior of the stability index.
From Equation (16), the possible values for the parameter p lie in interval .
Our goal is to calculate asymptotic evaluations of the stability index when (which would imply that is closer to ) and for extreme values of
the range of p, that is, we are interested in values of , and .
Now, we will proceed to calculate the stability index and for this, we will take an “initial” state as well as the following distribution functions to measure the effect of the shock on the processes:
Assumption 3.2. (Only for this example)
We consider the random vectors given in processes (14) and (15) respectively, have an exponential distribution with parameters and respectively, i.e., and with , where the values of measure the approximation between both distributions, .
Under Assumption 3.1 and 3.2, we have
and after some direct calculations,
Similarly, for the perturbed random vector, we have
and since , then from the above equality it follows that
Next, the stability index is calculated.
From Equation (8) we have
The first term on the right side of Equation (25) is given in Equation (21) with . The next thing is that we will calculate the second term on the right side of Equation (25): To do this, we substitute the approximate policy of optimal control with , given in Equation (22), in the reward function of the “original” model given in Equation (18), and we have
The above equation represents the discounted reward obtained when the trajectory of the “original” process given in Equation (14) is controlled by the optimal policy obtained from the “approximate” process given in Equation (15) and the “initial” state is .
Now, since (see  for details), we have
finally, we have
Now, the evolution of the approximate process (see Equation (15)) is represented as follows
If we raise the last equality to the power p, we have
Now if we take the expected value on both sides of the above equality and since the random elements are i.i.d.,
Now, by inequality (17),
Substituting Equation (27) in Equation (26) and after performing some direct calculations, we have
Inequalities (17) guarantees ; furthermore, since , it is guaranteed that . Finally, the two reasons above guarantee that .
Therefore, calculating the sum of geometric serie involved in Equation (28), this one can be expressed as
Then, to obtain the stability index, Equation (21) with and Equation (29) are substituted in Equation (25) and we obtain
Now, substituting Equation (24) in Equation (30), we have
For each fixed p, a θ value in Equation (23) can be selected such that , so Equation (31) can be written as
The stability index given in Equation (32) remains a function that depends on the discount factor (α), the parameter p of the reward function, see Equation (16), and the level of approximation of the distributions and , see Assumption 3.2.
3.2. Study of the Asymptotic Evaluations of the Stability Index
The goal of this work is to perform asymptotic numerical estimations of the stability index as a function of ( ), that is, find its order (κ) when , see Equation (10). For this, we will use the result obtained in the previous section of the explicit calculation of the stability index, see Equation (32).
Equation (32) shows that the stability index is a function of p and as mentioned in the previous section, this parameter of the utility function is important in economics since it is related to elasticity. So, to estimate the effect that this parameter has on the stability index, we will select arbitrary values of this parameter in such way that 1) values close to zero (it would imply consumers insensitive
to monetary change); 2) values close to (average consumers); and 3) values
close to 1 (sensitive consumers). However for our goal, these values of p would give us information about the conditions in which the approximate policy can be used to control the original process M, that is, we want to study if values
given of p close to zero (to and to 1) in the reward function allows us to use this approach.
Methodology and results obtained. For a fixed value of p in Equation (32) and given a value of , we will generate 100 values of α, starting at with increments of 0.005. Then, for each of the 100 generated values of , the value of ( ) is substituted in Equation (32) and we would have 100 values of the stability index (as function of ( )). With these 100 values of ( ) and the stability index , a simple linear regression model is performed to estimate the κ parameter involved in Equation (10) and this value would be the estimation of the order of the stability index with respect to ( ), i.e., . We are interested in the behavior of the k estimate when and .
For example, if then of Equation (32) we have that the stability index is expressed as
Now, remembering that values represent the measure of the approximation between the distributions and (see Assumption 3.2), so let’s assume and we substitute it in Equation (33), we have
Now, we generate 100 values of and later we substitute ( ) in Equation (34) and 100 values of the stability index are generated, that is shown in Figure 1.
Remark 3.3. In Figure 1, the stability index given in Equation (34) is represented as delta, this is, and the measure , we call epsilon.
From Figure 1, we can see that when that is, it is very costly to use the optimal policy of the approximate process given in Equation (22) to control the original process given in Equation (14).
Figure 1. Scatterplot generated by 100 data points of stability index obtained from Equation (34).
On the other hand, to obtain the asymptotic evaluations of the stability index when , that is, the estimation of the κ parameter that appears in Equation (10):
, , ,
we will proceed to estimate the following simple linear regression model:
, . (35)
where is a white noise (see  for definition), and κ are the parameters to be estimated with the results of 100 data generated and represented in Figure 1. The results of the regression estimate given in Equation (35) is shown below:
Regression Analysis: ln(delta) versus ln(1-alpha)
Therefore, from the above results we have that and from Equation (10) it can be concluded that the asymptotic estimate of the stability index when , it is , that is, the sensitivity of the stability index with respect to ( ) is
On the other hand, the estimation of this asymptotic evaluation of κ will be better when the approximation of the distribution is closer to the distribution (see Assumption 3.2), that is
if , then (and so ). (37)
To see the above, given the fixed value of , we proceeded to replicate the estimates of κ given in Equation (35) for .
For and , from Equation (33) we have the following stability index
For the same 100 values of ( ) and using the above equation, another 100 values of the stability index were generated, which are presented in Figure 2.
From this Figure 2, we observe that for and , (when
) it remains it is very costly to use the optimal policy of the approximate process given in Equation (22) to control the original process given in Equation (14); however, the stability index is reduced, that is due to the greater precision of in the approximation of the distribution to the distribution .
Figure 2. Scatterplot generated by 100 data points of stability index obtained from Equation (38).
Now, with this new 100 data from Figure 2, the κ parameter is re-estimated in the simple linear regression model given in Equation (35). The results obtained are the following:
Regression Analysis: ln(delta) versus ln(1-alpha)
The results show that , and we obtain that the stability index has an order −2.156 with respect to ( ), that is .
Now, to investigate the asymptotic behavior of this sensitivity κ, we will make the approximation between the probability functions better and better, i.e., .
So, analogously to what has already been explained, the results for and are presented in Figures 3-5.
The five figures above show that for fixed and when tends to
zero (which implies that approaches to ), then the stability index has zero (see y-values labels).
In the last two figures observe y-values labels, it is clear that when epsilon tends to zero, then the stability index also tends to zero. The above implies that, the better the approximation between the distribution functions then the approximate optimal policy can be used to control the original process.
Now, for each group of 100 data generated in each of the five graphs, the κ parameter involved in the simple linear regression model given in Equation (35) was estimated. The results obtained from these estimates are presented in Table 1,
Figure 3. Scatterplot generated by 100 data points of stability index obtained from Equation (33) with .
Figure 4. Scatterplot generated by 100 data points of stability index obtained from Equation (33) with .
Figure 5. Scatterplot generated by 100 data points of stability index obtained from Equation (33) with .
Figure 6. Results of the association of the stability index and the approximation measure in the probability distributions (epsilon).
Figure 7. Magnification of Figure 6, when alpha approaches to 1.
Table 1. Asymptotic evaluation of the stability index ( ).
note that the first two cases correspond to the results that have been explained in previous pages.
In Table 1, the green cell shows the best approximation used between the distribution functions (see Assumption 3.2) with which the numerical estimate for the asymptotic evaluation of the stability index was found, which is shown in the skyblue cell.
Based on results of Table 1, we can conclude that for , when
and , then the asymptotic evaluation of the stability index is , i.e., the stability index has an order
To study the sensitivity of the stability index ( ), numerical experiments were carried out for other values of p. Each of thesep values was substituted in Equation (32) and the stability indices ( ) were obtained as a function of α and as shown in Table 2.
Then, for each fixed value of p given in Table 2, we will use ; subsequently for each pair of fixed p and , 100 data of ( ) were generated and they were substituted in the formulas of Table 2 obtaining 100 data of the index stability as a function of ( ); finally, these 100 pairs of ( ) and were used for the asymptotic evaluation of the approximate stability index with the estimation of the κ parameter involved in the simple linear regression model given in Equation (35). The results obtained in these numerical estimates are presented in Table 3.
Table 2. Explicit expression of the stability index for different values of the reward function parameter (p).
Table 3. Explicit expression of the stability index for different values of the reward function parameter (p).
Remark 3.4. In Table 3, for values of the speed with which the
stability index has to infinity is greater; so it does not allow to obtain values of . Thus, for example the results shown in
Table 3 for were obtained with data, while for ,
they were performed with data. The results presented for the rest of the p values in Table 3 were obtained with 100 data.
Discussion of results. The motivation for studying discounted reward (cost) problems is primarily economic. Capital accumulation processes of an economy, inventory problems, inventory management, portfolio management, are applications of this type of optimization criteria. The reward function used in this work, see Equation (16), is a very used function in economics, it belongs to the family of consumer utility functions, specifically the so-called Cobb-Douglas utility function (see  for definitions), so the selection of the parameter p in Equation (16) it must be very careful. The results obtained in this work about the asymptotic evaluations of the stability index (which are presented in Table 3) are interpreted as follows:
1) If . That is, if the parameter p of the reward function approaches 1, then the sensitivity of the stability index grows indefinitely.
Therefore, for values of the use of an approximate policy to control the original process is not recommended, that is because the results show (see
Table 3) that for we have , which is why the stability index can be up to .
2) If . In this case, the results obtained (see Table 3) suggest that if values of are selected in the reward function
given in Equation (16), then it would seem reasonable to use the approximate policy to control the original process M.
3) If . In this case, the results obtained in this work using statistical techniques are the same as those found in articles  and , but using uppers bounds such as given in Equation (1).
Remember that by definition we have (see Equation (16)). Now, from the three previous points, the results obtained show that for extreme values of p (close to zero or one) it is not recommended to use an approximate policy to control the original process. The results obtained suggest a selection of the p
value close to the average ( ) in the reward function, to use of such an approximation.
Despite the extensive literature that exists on the subject of Markov control process, there are few works developed on the subject of estimating the stability index. The study of stability for control processes represents a challenge, both from a theoretical and an applied point of view. In this application work, it is intended to contribute to the study of stability using statistical techniques instead of probabilistic metrics. The limitations of this work are the use of a simple Markov control process as well as the use of an exponential distribution function to measure the shock effect of the process. However, the numerical estimates found are consistent and show their impact on the sensitivity of the stability index to changes in both the discount factor and the parameter in the reward function; obviously, the results obtained respond favorably to the original question, which was posed in the introduction, so we can conclude that the objective of this work was achieved. Finally, it is recommended to strengthen the results found in this work by carrying out some of the following future investigations: 1) Using more complex Markov control processes; 2) Validate the robustness of the results using another type of distribution function to measure the shock effect of the process; 3) Use another type of reward function; and 4) Use of other statistical techniques for the asymptotic estimation of the stability index.
The author wishes to thank referees for valuable suggestions on improvement of the previous version of the paper.
 Gordienko, E.I. (1992) An Estimate of the Stability of Optimal Control of Certain Stochastic and Deterministic Systems. Journal of Soviet Mathematics, 59, 891-899.
 Gordienko, E.I. and Salem, F.S. (1998) Robustness Inequalities for Markov Control Processes with Unbounded Cost. Systems & Control Letters, 33, 125-130.
 Gordienko, E.I. and Yushkevich, A.A. (2003) Stability Estimates in the Problem of Average Optimal Switching of a Markov Chain. Mathematical Methods of Operations Research, 57, 345-365.
 Gordienko, E.I., Lemus-Rodriguez, E. and Montes-de-Oca, R. (2008) Discounted Cost Optimality Problem: Stability with Respect to Weak Metrics. Mathematical Methods of Operations Research, 68, 77-96.
 Gordienko, E., Martínez, J. and Ruiz de Chávez, J. (2015) Stability Estimation of Transient Markov Decision Processes. In: Mena, R.H., Pardo, J.C., Rivero, V. and Bravo, G.U., Eds., XI Symposium on Probability and Stochastic Processes, Mexico, 18-22 November 2013, 157-176.
 Montes-de-Oca, R. and Salem-Silva, F. (2005) Estimates for Perturbations of Average Markov Decision Process with a Minimal State and Upper Bounded by Stochastically Ordered Markov Chains. Kybernetika, 41,757-772.