The Postal Service Company’s catalogue information system, inventory issues, and supply chain management issues are all early successful applications of the Markov decision process. Later, many real-life problems, such as sequential assignments, machine maintenance issues, and secretarial issues, can be described as dynamic Markov Decision Processes (MDP) model. They are finally solved very well by MDP   .
This paper considers the variance optimization problem of average reward in continuous-time Markov decision process. It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy class, which is different from the mean-variance criterion problem. The study of the mean-variance criterion problem is generally based on the discount criterion or the average criterion. In the literature of MDPs, many studies focus on the problem of expected reward optimization in finite stage, the discounted MDP in infinite stage and the average reward problem in infinite stage   . By establishing the optimal equation, then the existence of optimal policy is proved, and finally the policy iteration type algorithm is used to solve the MDP problem. However, in real-life, the optimal criteria of this unconstrained optimization problem are often not unique, such as queuing system and network problems. So we introduce variance to choose the optimal strategy.
Variance is an important performance metric of stochastic systems. In financial engineering, we use the mean to measure the expected return, and the variance to measure the risk. The mean-variance problem of the portfolio can be traced back to Markowitz  . Then the Markowitz’s mean-variance portfolio problem has been studied  -  , the decision maker’s expected reward is often assumed to be a constant, and then the investor chooses a policy with a given expected return to minimize this risk, we can see that the Markowitz mean-variance portfolio model is a model of maximization of return and minimization of risk. However, given expected return which may not be maximal, an optimal policy in Markowitz’ mean-variance portfolio may not be optimal in the usual sense of variance minimization problems for MDPs. Moreover, more and more real-life situations such as queuing systems and networks can be described as MDPs rather than stochastic differential equations, so Markowitz’s mean-variance portfolio problem should be extended to MDPs. For mean-variance problem of the MDPs, as in    , we aim to obtain a variance optimal policy over a set of policies where the average reward or discounted reward is optimal, so the variance criterion can be transformed into an equivalent average or discount criterion. However, when the mean criterion is not optimal, it is not clear how to develop a policy iteration algorithm to solve the problem. For discrete-time, discount and long-run average variance criterion problem has been studied in   . They mainly consider the variance optimization problem, and do not constrain the mean. For continuous-time, the variance of the average expected return has been defined in deterministic stationary policy. The finite-horizon expected reward is defined as below.
The variance of f, , is given by:
However, the variance function of average expected return of the continuous-time in this paper is given by
The long-run expected average reward is defined as below.
The main work of this paper is to find the iterative algorithm of the optimal policy under the variance criterion (minimum variance) in the countable state space and the Borel measurable action space. For countable state space, the reward function may be unbounded, the expected average reward , may not be infinite. To guarantee the finiteness of , we will impose the following Assumption 1. Next, we use a unique invariant probability measure of Markov chain to denote the average expected return and variance. To this end, we will impose the following Assumption 2, 3, 4. Suppose that Assumptions 1, 2, 3, and 4 are satisfied. We have established a variance criterion. Under the variance criterion, we define the cost function , where is the system reward at the current stage with state i and action a, and is the expected average reward. Obviously, the cost will be affected by future actions, so, is also affected by future actions. The traditional MDPs differs from this. The cost function and state transition probability depend only on the current state and the action selected on this stage. Therefore, the conclusions in    do not apply to this model. In this paper, we define a pseudo-variance , where is a given constant  . Obviously, the value of the pseudo-variance at current stage will not be affected by future actions. It is only related to the current state and current actions, so the pseudo-variance minimization problem is a standard MDP. In this paper, we prove the relation between variance and pseudo-variance. Unlike the literature  , we define the deviation of the deterministic stationary policy f for continuous-time MDP. It is proved that the deviation function and the objective function satisfy the Poisson equation, and the uniqueness of the Poisson equation is proved. Based on this, we develop a continuous time MDP policy iterative algorithm to get the optimal strategy, and we prove the convergence of the policy iterative algorithm.
2. Model and Optimization Criteria
The control model associated with the continuous-time MDP that we are concerned with is the five-tuple
1) A denumerable set S, called the stated space, which is the set of all the states of the system under observation.
2) A Borel space A, called the action space. Let
be the set of all feasible state-action pairs.
3) The transition rates which satisfy for all and . Moreover, we assume that the transition rates are conservative, i.e.,
and stable, which means that
where for all . In addition, is measurable in for each fixed .
4) A measurable real-valued function on K. called the reward function, which is assumed to be measurable in for each fixed .
The above model is a classical continuous-time MDP model  . In MDP, the policies have stochastic Markov policy, stochastic stationary policy and deterministic stationary policy. This paper only considers finding the minimal variance in the deterministic stationary policy class. So we only introduce the definition of deterministic stationary policy.
Definition 1. A deterministic stationary policy is a function such that is in for all . A deterministic stationary policy is simply referred to as a stationary policy.
Let F be the set of all deterministic stationary policies.
For each , the associated transition rates are defined as
the reward function is given by
Under Assumption 1, the transition function is regular  .
a) There exist a nondecreasing function , on S, and constants , and , such that
for all , where .
b) for all , with and as in (2.4).
c) , for all , with some .
d) The action set is compact for each , the functions , and are all continuous in for each fixed .
e) There exists a nonnegative function on S and constants , and such that
For all , with K and as in (2.2) and (2.4), respectively.
For all and an arbitrary initial state , there is a unique probability space , where the probability measure is determined by f and . is denoted as the expectation operator . Define the expected average reward and variance respectively.
Remark 1. From (2.9), we can see that the definition of is different from the definition of the continuous-time MDP average reward variance criterion in (  , chapter 10), where the value of cost function will be affected by future actions, so this is not a standard MDP optimization problem.
Let’s give some marks first. For any measurable function on S, we define the -weighted supremum norm of a real-valued measurable function on S by
and the Banach space . Similarly, we can define .
We will use the Markov chain invariant measure to represent Equation (2.8) and Equation (2.9). To this end, we impose the following three Assumptions (see  ).
For each , the corresponding Markov process with transition function is irreducible, which means that, for any two states , there exists a set of distinct states such that
Under Assumptions 1(a) and 2, for each , Propositions C.11 and C.12 yield that the Markov chain has a unique invariant probability measure, denoted by , which satisfies that (independent of ) for all . Thus, by  , we have
which shows that the -expectation of (i.e., ) is finite. Therefore, for all and , the inequality for all gives that the expectation
exists and is finite.
With as in Assumption 1, assume the following conditions are true:
a) There exists constants , such that
For all .
b) There exists a nonnegative function on S and constants and , such that
for all , with K and as in (2.2) and (2.4).
a) The control model (2.1) is uniformly ω-exponentially ergodic, which means the following: there exist constants , and such that
for all , , and .
b) The control model (2.1) is uniformly -exponentially ergodic, the definition as in (a)
Remark 2. Under the premise of the above assumptions, it can be known from the literature  that for the given , the average reward and variance defined by Equation (2.8) and Equation (2.9) are both a number, independent of the initial state. They can represent the expectation form of invariant measures , such that
We denote as an S-dimensional column vector composed by element and as an S-dimensional column vector composed by element
where . (2.19)
Our optimization goal is to select that satisfies the following condition
By (2.20), the variance minimization problem of Markov chains can be defined as below.
Remark 3. From (2.21), we see that the value will be affected by future actions. There the problem (2.21) is different from standard MDP.
Even if we consider as a cost function, we can’t directly use the existing conclusions to get the optimal policy.
3. Analysis and Optimization
In this section, we will define a pseudo-variance minimization problem. By proving the relation between the pseudo-variance and the variance, the optimization problem of (2.21) is transformed into the pseudo-variance optimization problem. Further, the optimal policy for variance optimization problem can be derived by the policy iterative algorithm for the pseudo-variance optimization problem, and we can give a sufficient condition for the variance optimal policy.
3.1. Pseudo-Variance Minimization
We define a new cost function as below.
where is a given constant. We denote as an S-dimensional column vector composed by element and we have
where denote an S-dimensional column vector composed by element 1. We define pseudo-variance function as below
Obviously, we have
, when .
the pseudo-variance minimization problem of Markov chains can be defined as below.
From (3.4), we can see that is an instant cost and it has no relation to future actions, Below, we study the relation between these two problems (2.21) and (3.4). First, we have the following lemma about the relation between and .
Lemma 1. For all , the corresponding variance and the pseudo-variance has the following relation
Proof: From (3.1) and (3.3)
The lemma is proved.
Below we discuss how to solve the pseudo-variance minimum problem. Because (3.4) is a traditional MDP optimization problem, we can solve the problem with the policy iterative algorithm (3.4). Before using the policy iterative algorithm to solve the problem (3.4), we need to prove the existence of the pseudo-variance optimal policy. We suppose that Assumption 1, 2, 3, and 4 are all satisfied, we give the following theorems and lemmas.
Theorem 1. A pair is said to be a solution to the pseudo-variance of average-reward optimality equation if
Lemma 2. Suppose that Assumptions 1, 2, 3, and 4 are satisfied. Consider an arbitrary fixed state . Then, for all and discount factors , the relative differences of the discounted-reward function , namely,
are uniformly ω-bounded in and . More precisely, we have
where . (3.9)
Prove: According to the literature  , Lemma 2 can be known.
Theorem 2. Suppose that Assumptions 1, 2, 3, and 4 hold. Then:
There exists a solution to pseudo variance of average-reward optimality equation. Moreover, the constant coincides with the optimal average reward function , i.e.
Prove: Our assumptions ensure the existence of a policy attaining the minimization in the pseudo-variance of average-reward optimality equation, that is,
Therefore, Proposition 7.3 of the literature  gives .
As a consequence, for every , and, moreover, is optimal policy of pseudo-variance.
In the case where the existence of the pseudo-variance optimal policy is guaranteed, we use the policy iterative algorithm to get the optimal policy.
Suppose that Assumptions 1, 2, 3, and 4 hold, we gave the following concepts.
Definition 2. We define the bias of f as
Assumption 4: Gives
So, is finite and in . Moreover, the bias is uniformly bounded in the -norm.
Next we introduce the Poisson equation, which is one of the main results of this paper.
Theorem 3. Let . We say that a pair is a solution to the Poisson equation for if
Proof: Our assumptions (in particular, Assumptions 1 and 4) allow us to interchange the sums and integrals in the following equations:
The theorem is proved.
Finally, we should prove the uniqueness of the solution of Poisson’s equation.
Theorem 4. For every , the solutions to the Poisson equation for f are of the form with z any real number. Moreover, is the unique solution to the Poisson equation
Prove: Suppose now that and are two solutions to the Poisson equation, simultaneous transformation of both sides of Equation (3.14)
Because is also the solution of the Poisson equation, therefore
(3.15) subtract (3.16)
letting , it follows from Assumption 4 that , showing that the functions and differ by the constant .
It remains to show that .
Remark 4. Given , we can determine the gain and the bias of f by solving the following system of linear equations. First, determine the i.p.m. (invariant probability measure) vas the unique nonnegative solution (by Proposition C.12) to
Then, as a consequence of lecture  , the gain
and the bias of f form the unique solution to the system of linear equations
Proposition 1: Policy iterative algorithm.
Step 1. From , we can choose a arbitrary.
Step 2. (Strategy evaluation process) Determine the pseudo-variance and deviation of the stationary policy as in Remark 4.
Step 3. (Policy Improvement Process) Choose as an improvement policy such that
Step 4. If , the iteration stops, it is the optimal strategy to minimize the pseudo variance, otherwise, replace f with and return to step 2.
Proposition 2: Convergence of the strategy iterative algorithm.
When the assumptions 1, 2, 3, 4 are established, let be an arbitrary initial policy, let be the sequence of policies obtained from the policy iterative algorithm. The one of the following results is hold.
1) After a finite number of policy iterations, the algorithm converges to the pseudo variance of average-reward optimal strategy.
2) as , the sequence converges to the optimal AR function value .
3.2. Variance Minimization
The minimum pseudo-variance problem has been solved. The following theorem gives that when the pseudo-variance reaches a minimum, the variance is also minimized.
Theorem 5. For any policy , we compute with (2.8), and set . If we obtain an improved policy such that , then we have . If , such that .
Prove: With lemma 1, we have
Let (3.22) subtract (3.21) and substituting , we obtain
Obviously, if , then ; if , then .
With Theorem 3: when the pseudo-variance reaches a minimum, the variance also reaches a minimum. A sufficient condition for the variance minimization problem is obtained.
This section, we give an example to illustrate the conclusions of this paper.
Example 1 (Control of queue systems)
The system state indicates the number of customers waiting at the moment (including being served), the arrival rate is fixed, and the service rates can be controlled. When the system status is , the decision-maker takes an action a from the allowed action set . When the system is empty, we may impose that . For each , let with constants , , which may increase or decrease the service rate. This action incurs a cost . In addition, suppose that there is a benefit represented by for each arriving customer, and then the net income of the system is
This is a continuous time MDP model, the corresponding transition rate are given as follows.
For each ,
For all ,
Our goal is to find the existence of a variance optimal policy. To this end, we consider the following assumptions:
D2. The function is continuous in for each fixed , and for all , for some constant .
Proposition 3: under conditions D1, D2, the above controlled satisfies Assumptions 1, 2, 3, and 4. Therefore, there exists an variance optimal stationary policy.
Proof: Let , , , for all . Then, from (4.2) and (4.3), we have
Moreover, for all
which verifies Assumption 1(a).
On the other hand, by (4.2)-(4.3), we have
So Assumption 1(b) follows.
By (4.1) and D2, we have , for all , which implies Assumption 1(c).
By (4.2)-(4.3) and D2, we see that Assumption 1(d) holds.
To verity Assumption 1(e), let
, for each (4.7)
Then by (4.2)-(4.3) we have
which imply Assumption 1(e) with
Obviously, Assumption 2 follows from the description of the model.
We verity Assumption 3, by D1 and (4.2)-(4.3), for all , , we have
(there is , (presence is guaranteed by (D1))
(there is , (presence is guaranteed by (D1))
Let , as above, , and , by (4.10)-(4.14), we see that Assumption 1(d) holds.
Finally, we verify Assumption4, by (4.2)-(4.3) we have, for each fixed ,
which, together with Proposition C.16 of  , implies that the corresponding Markov process is stochastically ordered. Thus, Assumption 4(a) follows from Proposition 7.6. Similarly, the assumption 4(b) is established.
Assumption 1, 2, 3 and 4 holds. So, the Proposition is proved.
5. Discussion and Conclusion
In this article, it defines the variance optimization problem of continuous-time Markov decision processes, which is different from the mean-variance optimization problem previously studied. By defining pseudo-variance, the deviation of the deterministic stationary policy f and the Poisson equation, a series of concepts and theorems, we prove the existence of the variance optimal strategy in the deterministic stationary policy space, and give the policy iterative algorithm to calculate optimal policy. Finally we prove the convergence of the policy iterative algorithm.