OJS  Vol.9 No.2 , April 2019
Variance Optimization for Continuous-Time Markov Decision Processes
Author(s) Yaqing Fu
ABSTRACT
This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy space. Unlike the traditional Markov decision process, the cost function in the variance criterion will be affected by future actions. To this end, we convert the variance minimization problem into a standard (MDP) by introducing a concept called pseudo-variance. Further, by giving the policy iterative algorithm of pseudo-variance optimization problem, the optimal policy of the original variance optimization problem is derived, and a sufficient condition for the variance optimal policy is given. Finally, we use an example to illustrate the conclusion of this paper.

1. Introduction

The Postal Service Company’s catalogue information system, inventory issues, and supply chain management issues are all early successful applications of the Markov decision process. Later, many real-life problems, such as sequential assignments, machine maintenance issues, and secretarial issues, can be described as dynamic Markov Decision Processes (MDP) model. They are finally solved very well by MDP [1] [2] .

This paper considers the variance optimization problem of average reward in continuous-time Markov decision process. It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy class, which is different from the mean-variance criterion problem. The study of the mean-variance criterion problem is generally based on the discount criterion or the average criterion. In the literature of MDPs, many studies focus on the problem of expected reward optimization in finite stage, the discounted MDP in infinite stage and the average reward problem in infinite stage [2] [3] . By establishing the optimal equation, then the existence of optimal policy is proved, and finally the policy iteration type algorithm is used to solve the MDP problem. However, in real-life, the optimal criteria of this unconstrained optimization problem are often not unique, such as queuing system and network problems. So we introduce variance to choose the optimal strategy.

Variance is an important performance metric of stochastic systems. In financial engineering, we use the mean to measure the expected return, and the variance to measure the risk. The mean-variance problem of the portfolio can be traced back to Markowitz [4] . Then the Markowitz’s mean-variance portfolio problem has been studied [5] - [13] , the decision maker’s expected reward is often assumed to be a constant, and then the investor chooses a policy with a given expected return to minimize this risk, we can see that the Markowitz mean-variance portfolio model is a model of maximization of return and minimization of risk. However, given expected return which may not be maximal, an optimal policy in Markowitz’ mean-variance portfolio may not be optimal in the usual sense of variance minimization problems for MDPs. Moreover, more and more real-life situations such as queuing systems and networks can be described as MDPs rather than stochastic differential equations, so Markowitz’s mean-variance portfolio problem should be extended to MDPs. For mean-variance problem of the MDPs, as in [14] [15] [16] , we aim to obtain a variance optimal policy over a set of policies where the average reward or discounted reward is optimal, so the variance criterion can be transformed into an equivalent average or discount criterion. However, when the mean criterion is not optimal, it is not clear how to develop a policy iteration algorithm to solve the problem. For discrete-time, discount and long-run average variance criterion problem has been studied in [17] [18] . They mainly consider the variance optimization problem, and do not constrain the mean. For continuous-time, the variance of the average expected return has been defined in deterministic stationary policy. The finite-horizon expected reward is defined as below.

V T ( i , f ) = E i f { 0 T r ( X ( t ) , f ) d t } , (1.1)

The variance of f, σ 2 ( i , f ) , is given by:

σ 2 ( i , f ) = lim T 1 T E i f { 0 T r ( X ( t ) , f ) d t V T ( i , f ) } 2 (1.2)

However, the variance function of average expected return of the continuous-time in this paper is given by

η σ f = lim T 1 T E i f { 0 T ( r ( X ( t ) , f ) η f ) 2 d t } , (1.3)

The long-run expected average reward is defined as below.

η f = lim T 1 T E i f { 0 T r ( X ( t ) , f ) d t } . (1.4)

The main work of this paper is to find the iterative algorithm of the optimal policy under the variance criterion (minimum variance) in the countable state space and the Borel measurable action space. For countable state space, the reward function r ( i , a ) may be unbounded, the expected average reward η f , may not be infinite. To guarantee the finiteness of η f , we will impose the following Assumption 1. Next, we use a unique invariant probability measure of Markov chain to denote the average expected return and variance. To this end, we will impose the following Assumption 2, 3, 4. Suppose that Assumptions 1, 2, 3, and 4 are satisfied. We have established a variance criterion. Under the variance criterion, we define the cost function m ( i , a ) = ( r ( i , a ) η f ) 2 , where r ( i , a ) is the system reward at the current stage with state i and action a, and η f is the expected average reward. Obviously, the cost will be affected by future actions, so, η f is also affected by future actions. The traditional MDPs differs from this. The cost function and state transition probability depend only on the current state and the action selected on this stage. Therefore, the conclusions in [14] [15] [16] do not apply to this model. In this paper, we define a pseudo-variance m λ ( i , a ) = ( r ( i , a ) λ ) 2 , where λ is a given constant [17] . Obviously, the value of the pseudo-variance at current stage will not be affected by future actions. It is only related to the current state and current actions, so the pseudo-variance minimization problem is a standard MDP. In this paper, we prove the relation between variance and pseudo-variance. Unlike the literature [17] , we define the deviation of the deterministic stationary policy f for continuous-time MDP. It is proved that the deviation function and the objective function satisfy the Poisson equation, and the uniqueness of the Poisson equation is proved. Based on this, we develop a continuous time MDP policy iterative algorithm to get the optimal strategy, and we prove the convergence of the policy iterative algorithm.

2. Model and Optimization Criteria

The control model associated with the continuous-time MDP that we are concerned with is the five-tuple

{ S , ( A ( i ) A , i S ) , q ( . | i , a ) , r ( i , a ) } (2.1)

1) A denumerable set S, called the stated space, which is the set of all the states of the system under observation.

2) A Borel space A, called the action space. Let

K : = { ( i , a ) | i S , a A ( i ) } . (2.2)

be the set of all feasible state-action pairs.

3) The transition rates q ( j | i , a ) which satisfy q ( j | i , a ) 0 for all ( i , a ) K and j i . Moreover, we assume that the transition rates q ( j | i , a ) are conservative, i.e.,

j S q ( j | i , a ) = 0 ( i , a ) K (2.3)

and stable, which means that

q ( i ) : = sup a A ( i ) q i ( a ) < i S (2.4)

where q i ( a ) : = q ( i | i , a ) 0 for all ( i , a ) K . In addition, j S q ( j | i , a ) is measurable in a A ( i ) for each fixed i , j S .

4) A measurable real-valued function r ( i , a ) on K. called the reward function, which is assumed to be measurable in a A ( i ) for each fixed i S .

The above model is a classical continuous-time MDP model [3] . In MDP, the policies have stochastic Markov policy, stochastic stationary policy and deterministic stationary policy. This paper only considers finding the minimal variance in the deterministic stationary policy class. So we only introduce the definition of deterministic stationary policy.

Definition 1. A deterministic stationary policy is a function f : S A ( i ) such that f ( i ) is in A ( i ) for all i S . A deterministic stationary policy is simply referred to as a stationary policy.

Let F be the set of all deterministic stationary policies.

For each f F , the associated transition rates are defined as

q ( j | i , f ) : = q ( j | i , f ( i ) ) i , j S , (2.5)

the reward function is given by

r ( i , f ) : = r ( i , f ( i ) ) i S . (2.6)

Under Assumption 1, the transition function p f ( i , t , j ) is regular [3] .

Assumption 1:

a) There exist a nondecreasing function ω 1 , on S, and constants c 1 > 0 , and b 1 > 0 , such that

j S ω ( j ) q ( j | i , a ) c 1 ω ( i ) + b 1 δ i 0 ,

for all ( i , a ) K , where δ 00 = 1 , δ i 0 = 0 , i 0 .

b) q ( i ) L 0 ω ( i ) for all i S , with L 0 > 0 and q ( i ) as in (2.4).

c) | r ( i , a ) | M ω ( i ) , for all ( i , a ) K , with some M > 0 .

d) The action set A ( i ) is compact for each i S , the functions r ( i , a ) , q ( j | i , a ) and k S ω ( k ) q ( k | i , a ) are all continuous in a A ( i ) for each fixed i , j S .

e) There exists a nonnegative function ω on S and constants c > 0 , b > 0 , and L 1 > 0 such that

q ( i ) ω ( i ) L 1 ω ( i ) and j S ω ( j ) q ( j | i , a ) c ω ( i ) + b (2.7)

For all ( i , a ) K , with K and q ( i ) as in (2.2) and (2.4), respectively.

For all f F and an arbitrary initial state i S , there is a unique probability space ( Ω , B ( Ω ) , P i f ) , where the probability measure P i f is determined by f and p f ( i , t , j ) . E i f is denoted as the expectation operator P i f . Define the expected average reward and variance respectively.

η f ( i ) = lim T 1 T E i f { 0 T r ( X ( t ) , f ) d t } , (2.8)

η σ f ( i ) = lim T 1 T E i f { 0 T ( r ( X ( t ) , f ) η f ) 2 d t } . (2.9)

Remark 1. From (2.9), we can see that the definition of η σ f is different from the definition of the continuous-time MDP average reward variance criterion in ( [3] , chapter 10), where the value of cost function will be affected by future actions, so this is not a standard MDP optimization problem.

Let’s give some marks first. For any measurable function ω 1 on S, we define the ω -weighted supremum norm . ω of a real-valued measurable function ω on S by

u ω : = sup i S | u ( i ) | ω ( i ) , (2.10)

and the Banach space B ω ( S ) : = { u : u ω < } . Similarly, we can define B ω 2 ( S ) .

We will use the Markov chain invariant measure to represent Equation (2.8) and Equation (2.9). To this end, we impose the following three Assumptions (see [3] ).

Assumption 2:

For each f F , the corresponding Markov process { X ( t ) } with transition function p f ( i , t , j ) is irreducible, which means that, for any two states i j , there exists a set of distinct states i = i 1 , , i m such that

q ( i 2 | i 1 , f ) q ( j | i m , f ) > 0 . (2.11)

Under Assumptions 1(a) and 2, for each f F , Propositions C.11 and C.12 yield that the Markov chain { X ( t ) } has a unique invariant probability measure, denoted by μ f , which satisfies that μ f ( j ) = lim t p f ( i , t , j ) (independent of i S ) for all j S . Thus, by [3] , we have

μ f ( ω ) : = j S ω ( j ) μ f ( j ) b 1 c 1 , (2.12)

which shows that the μ f -expectation of ω (i.e., μ f ( ω ) ) is finite. Therefore, for all f F and u B ω ( S ) , the inequality | u ( i ) | u ω ω ( i ) for all i S gives that the expectation

μ f ( u ) : = i S u ( i ) μ f ( i ) (2.13)

exists and is finite.

Assumption 3:

With ω as in Assumption 1, assume the following conditions are true:

a) There exists constants c > 0 , b > 0 , such that

j S ω 2 ( j ) q ( j | i , a ) c 2 ω 2 ( i ) + b 2 (2.14)

For all ( i , a ) K .

b) There exists a nonnegative function ω on S and constants c > 0 , b > 0 and L 2 0 , such that

q ( i ) ω 2 ( i ) L 2 ω ( i ) , j S ω ( j ) q ( j | i , a ) c ω ( i ) + b (2.15)

for all ( i , a ) K , with K and q ( i ) as in (2.2) and (2.4).

Assumption 4:

a) The control model (2.1) is uniformly ω-exponentially ergodic, which means the following: there exist constants β > 0 , and L 3 > 0 δ > 0 such that

sup f F | E i f u ( X ( t ) ) μ f ( u ) | L 3 e β t u ω ω ( i ) (2.16)

for all i S , u B ω ( S ) , and t 0 .

b) The control model (2.1) is uniformly ω 2 -exponentially ergodic, the definition as in (a)

Remark 2. Under the premise of the above assumptions, it can be known from the literature [3] that for the given f F , the average reward and variance defined by Equation (2.8) and Equation (2.9) are both a number, independent of the initial state. They can represent the expectation form of invariant measures μ f , such that

η f = μ f r : = i S r ( i , f ) μ f ( i ) , (2.17)

η σ f = μ f m : = i S m ( i , f ) μ f ( i ) . (2.18)

We denote r as an S-dimensional column vector composed by element r ( i , f ) and m as an S-dimensional column vector composed by element m ( i , f )

where m ( i , f ) : = ( r ( i , f ) η f ) 2 i S , f F . (2.19)

Our optimization goal is to select f F that satisfies the following condition

η σ f = min f F η σ f (2.20)

By (2.20), the variance minimization problem of Markov chains can be defined as below.

f = arg min f F { η σ f } = arg min f F { μ f m } (2.21)

Remark 3. From (2.21), we see that the value η f will be affected by future actions. There the problem (2.21) is different from standard MDP.

Even if we consider m as a cost function, we can’t directly use the existing conclusions to get the optimal policy.

3. Analysis and Optimization

In this section, we will define a pseudo-variance minimization problem. By proving the relation between the pseudo-variance and the variance, the optimization problem of (2.21) is transformed into the pseudo-variance optimization problem. Further, the optimal policy for variance optimization problem can be derived by the policy iterative algorithm for the pseudo-variance optimization problem, and we can give a sufficient condition for the variance optimal policy.

3.1. Pseudo-Variance Minimization

We define a new cost function as below.

m λ ( i , f ) = ( r ( i , f ) λ ) 2 i S , f F . (3.1)

where λ is a given constant. We denote m λ as an S-dimensional column vector composed by element m λ ( i , f ) and we have

m λ : = ( r λ Ι ) 2 (3.2)

where Ι denote an S-dimensional column vector composed by element 1. We define pseudo-variance function as below

η σ λ f = μ f m λ . (3.3)

Obviously, we have

η σ λ f = η σ f , when λ = η f .

the pseudo-variance minimization problem of Markov chains can be defined as below.

f λ * = arg min f F { η σ λ f } = arg min f F { μ f m λ } = arg min f F i S μ f ( i ) ( r ( i , f ) λ ) 2 (3.4)

From (3.4), we can see that m λ is an instant cost and it has no relation to future actions, Below, we study the relation between these two problems (2.21) and (3.4). First, we have the following lemma about the relation between η σ λ f and η σ f .

Lemma 1. For all f F , the corresponding variance and the pseudo-variance has the following relation

η σ λ f = η σ f + ( η f λ ) 2 . (3.5)

Proof: From (3.1) and (3.3)

η σ λ f = i S μ f ( i ) ( r ( i , f ) λ ) 2 = i S μ f ( i ) ( r ( i , f ) η f + η f λ ) 2 = i S μ f ( i ) [ ( r ( i , f ) η f ) 2 + ( η f λ ) 2 + 2 ( r ( i , f ) η f ) ( η f λ ) ] = η σ f + ( η f ) 2 + λ 2 2 η f λ + 2 ( η f ) 2 2 η f λ 2 ( η f ) 2 + 2 η f λ = η σ f + ( η f λ ) 2

The lemma is proved.

Below we discuss how to solve the pseudo-variance minimum problem. Because (3.4) is a traditional MDP optimization problem, we can solve the problem with the policy iterative algorithm (3.4). Before using the policy iterative algorithm to solve the problem (3.4), we need to prove the existence of the pseudo-variance optimal policy. We suppose that Assumption 1, 2, 3, and 4 are all satisfied, we give the following theorems and lemmas.

Theorem 1. A pair ( g , u ) × B ω 2 ( S ) is said to be a solution to the pseudo-variance of average-reward optimality equation if

g = inf a A ( i ) { ( r ( i , a ) λ ) 2 + j S u ( j ) q ( j | i , a ) } i S (3.6)

Lemma 2. Suppose that Assumptions 1, 2, 3, and 4 are satisfied. Consider an arbitrary fixed state i 0 S . Then, for all f F and discount factors α > 0 , the relative differences of the discounted-reward function η α f , namely,

u α f ( i ) = η α f ( i ) η α f ( i 0 ) i S (3.7)

are uniformly ω-bounded in α > 0 and f F . More precisely, we have

u α f ω 2 L 3 ( M + | λ | ) 2 δ [ 1 + ω 2 ( i 0 ) ] α > 0 , f F . (3.8)

where η α f ( i ) = E i f [ 0 e α t ( r ( X ( t ) , f ) λ ) 2 d t ] . (3.9)

Prove: According to the literature [3] , Lemma 2 can be known.

where m λ ( i , f ) = ( r ( i , f ) λ ) 2 i S , f F .

Theorem 2. Suppose that Assumptions 1, 2, 3, and 4 hold. Then:

There exists a solution ( g , u ) × B ω 2 ( S ) to pseudo variance of average-reward optimality equation. Moreover, the constant g coincides with the optimal average reward function η σ λ , i.e.

g = η σ λ ( i ) i S (3.10)

Prove: Our assumptions ensure the existence of a policy attaining the minimization in the pseudo-variance of average-reward optimality equation, that is,

g = ( r ( i , f ) λ ) 2 + j S u ¯ ( j ) q ( j | i , f ) i S , t 0 , (3.11)

Therefore, Proposition 7.3 of the literature [3] gives g = η σ λ f ( i ) i S .

As a consequence, g = η σ λ ( i ) for every i S , and, moreover, f is optimal policy of pseudo-variance.

In the case where the existence of the pseudo-variance optimal policy is guaranteed, we use the policy iterative algorithm to get the optimal policy.

Suppose that Assumptions 1, 2, 3, and 4 hold, we gave the following concepts.

Definition 2. We define the bias of f as

h σ λ f ( i ) = 0 [ E i f ( r ( X ( t ) , f ) λ ) 2 η σ λ f ] d t i S . (3.12)

Assumption 4: Gives

| E i f ( r ( X ( t ) , f ) λ ) 2 η σ λ f | L 3 ( M + | λ | ) 2 e δ t ω 2 ( i )

| h σ λ f ( i ) | 0 L 3 ( M + | λ | ) 2 e δ t ω 2 ( i ) d t = L 3 ( M + | λ | ) 2 ω 2 ( i ) δ

sup f F h σ λ f ω 2 L 3 ( M + | λ | ) 2 δ .

So, h σ λ f is finite and in B ω 2 ( S ) . Moreover, the bias is uniformly bounded in the ω 2 -norm.

Next we introduce the Poisson equation, which is one of the main results of this paper.

Theorem 3. Let f F . We say that a pair ( η σ λ f , h σ λ f ) × B ω 2 ( S ) is a solution to the Poisson equation for f F if

η σ λ f = ( r ( i , f ) λ ) 2 + j S h σ λ f ( j ) q ( j | i , f ) f F , i S . (3.13)

Proof: Our assumptions (in particular, Assumptions 1 and 4) allow us to interchange the sums and integrals in the following equations:

j S h σ λ f ( j ) q ( j | i , f ) = j S [ 0 ( E j f ( r ( X ( t ) , f ) λ ) 2 η σ λ f ) d t ] q ( j | i , f ) = 0 j S E j f ( r ( X ( t ) , f ) λ ) 2 q ( j | i , f ) d t 0 = 0 j S k S ( r ( k , f ) λ ) 2 p ( j , t , k ) q ( j | i , f ) d t = 0 k S ( r ( k , f ) λ ) 2 j S q ( j | i , f ) p ( j , t , k ) d t

= 0 k S ( r ( k , f ) λ ) 2 d d t p f ( i , t , k ) d t = k S ( r ( k , f ) λ ) 2 p f ( i , t , k ) | 0 = k S ( r ( k , f ) λ ) 2 μ f ( k ) ( r ( i , f ) λ ) 2 = η σ λ f ( r ( i , f ) λ ) 2

We have

η σ λ f = ( r ( i , f ) λ ) 2 + j S h σ λ f ( j ) q ( j | i , f ) .

The theorem is proved.

Finally, we should prove the uniqueness of the solution of Poisson’s equation.

Theorem 4. For every f F , the solutions to the Poisson equation for f are of the form ( η σ λ f , h σ λ f + z ) with z any real number. Moreover, is the unique solution to the Poisson equation

η σ λ f = ( r ( i , f ) λ ) 2 + j S h σ λ f ( j ) q ( j | i , f ) i S (3.14)

for which μ f ( h σ λ f ) = 0

Prove: Suppose now that ( η σ λ f , h σ λ f ) and ( η σ λ f , h σ λ f ) are two solutions to the Poisson equation, simultaneous transformation of both sides of Equation (3.14)

η σ λ f T = 0 T k S ( r ( k , f ) λ ) 2 p ( 0 , i , t , k ) d t + j S 0 T h σ λ f ( j ) k S q ( j | k , f ) p ( 0 , i , t , k ) d t = 0 T E i f ( r ( X ( t ) , f ) λ ) 2 d t + j S 0 T h σ λ f ( j ) k S p ( 0 , i , t , k ) q ( j | k , f ) d t = 0 T E i f ( r ( X ( t ) , f ) λ ) 2 d t + j S 0 T h σ λ f ( j ) d p ( 0 , t , t , j )

= 0 T E i f ( r ( X ( t ) , f ) λ ) 2 d t + j S h σ λ f ( j ) p ( 0 , i , t , j ) | 0 T = 0 T E i f ( r ( X ( t ) , f ) λ ) 2 d t + j S h σ λ f ( j ) p ( 0 , i , T , j ) h σ λ f ( i ) = 0 T E i f ( r ( X ( t ) , f ) λ ) 2 d t + E i f h σ λ f ( X ( T ) ) h σ λ f ( i ) (3.15)

Because ( η σ λ f , h σ λ f ) is also the solution of the Poisson equation, therefore

η σ λ f T = 0 T E i f ( r ( X ( t ) , f ) λ ) 2 d t + E i f h σ λ f ( X ( T ) ) h σ λ f ( i ) (3.16)

(3.15) subtract (3.16)

E i f [ h σ λ f ( X ( t ) ) h σ λ f ( X ( t ) ) ] = h σ λ f ( i ) h σ λ f ( i ) i S . (3.17)

letting t , it follows from Assumption 4 that μ f ( h σ λ f h σ λ f ) = h σ λ f ( i ) h σ λ f ( i ) , showing that the functions h σ λ f and h σ λ f differ by the constant μ f ( h σ λ f h σ λ f ) .

It remains to show that μ f ( h σ λ f ) = 0 .

μ f ( h σ λ f ) = i S h σ λ f ( i ) μ f ( i ) = i S 0 [ E i f ( r ( X ( t ) ) λ ) 2 η σ λ f ] d t μ f ( i ) = i S 0 [ E i f ( r ( X ( t ) ) λ ) 2 ( r ( i , f ) λ ) 2 j S h σ λ f ( j ) q ( j | i , f ) ] d t μ f ( i ) = i S 0 [ E i f ( r ( X ( t ) ) λ ) 2 ] d t μ f ( i ) 0 i S ( r ( i , f ) λ ) 2 μ f ( i ) d t 0

= i S 0 j S ( r ( j , f ) λ ) 2 p ( i , t , j ) μ f ( i ) d t 0 i S ( r ( i , f ) λ ) 2 μ f ( i ) d t = 0 j S ( r ( j , f ) λ ) 2 i S μ f ( i ) p ( i , t , j ) d t 0 i S ( r ( i , f ) λ ) 2 μ f ( i ) d t = 0 j S ( r ( j , f ) λ ) 2 μ f ( j ) d t 0 i S ( r ( i , f ) λ ) 2 μ f ( i ) d t = 0

Remark 4. Given f F , we can determine the gain and the bias of f by solving the following system of linear equations. First, determine the i.p.m. (invariant probability measure) vas the unique nonnegative solution (by Proposition C.12) to

{ j S q ( j | i , f ) μ f ( j ) = 0 j S μ f ( j ) = 1 (3.18)

Then, as a consequence of lecture [2] , the gain

η σ λ f = j S ( r ( j , f ) λ ) 2 μ f ( j )

and the bias h σ λ f B ω 2 ( S ) of f form the unique solution to the system of linear equations

{ η σ λ f = ( r ( i , f ) λ ) 2 + j S h σ λ f ( j ) q ( j | i , f ) i S h σ λ f ( i ) μ f ( i ) = 0 (3.19)

Proposition 1: Policy iterative algorithm.

Step 1. From f F , we can choose a arbitrary.

Step 2. (Strategy evaluation process) Determine the pseudo-variance and deviation of the stationary policy as in Remark 4.

Step 3. (Policy Improvement Process) Choose f as an improvement policy such that

f arg min f F { ( r ( i , f ) λ ) 2 + j S h σ λ f ( j ) q ( j | i , f ) } . (3.20)

Step 4. If f = f , the iteration stops, it is the optimal strategy to minimize the pseudo variance, otherwise, replace f with f and return to step 2.

Proposition 2: Convergence of the strategy iterative algorithm.

When the assumptions 1, 2, 3, 4 are established, let f 1 F be an arbitrary initial policy, let f n F be the sequence of policies obtained from the policy iterative algorithm. The one of the following results is hold.

1) After a finite number of policy iterations, the algorithm converges to the pseudo variance of average-reward optimal strategy.

2) as n , the sequence η σ λ f n converges to the optimal AR function value η σ λ f .

3.2. Variance Minimization

The minimum pseudo-variance problem has been solved. The following theorem gives that when the pseudo-variance reaches a minimum, the variance is also minimized.

Theorem 5. For any policy f F , we compute η f with (2.8), and set λ = η f . If we obtain an improved policy f such that η σ λ f η σ λ f , then we have η σ f η σ f . If η σ λ f < η σ λ f , such that η σ f < η σ f .

Prove: With lemma 1, we have

η σ λ f = η σ f + ( η f λ ) 2 . (3.21)

η σ λ f = η σ f + ( η f λ ) 2 . (3.22)

Let (3.22) subtract (3.21) and substituting λ = η f , we obtain

η σ f η σ f = η σ λ f η σ λ f ( η f η f ) 2 (3.23)

Obviously, if η σ λ f η σ λ f , then η σ f η σ f ; if η σ λ f < η σ λ f , then η σ f < η σ f .

With Theorem 3: when the pseudo-variance reaches a minimum, the variance also reaches a minimum. A sufficient condition for the variance minimization problem is obtained.

4. Examples

This section, we give an example to illustrate the conclusions of this paper.

Example 1 (Control M / M / of queue systems)

The system state X ( t ) indicates the number of customers waiting at the moment (including being served), the arrival rate λ is fixed, and the service rates μ can be controlled. When the system status is i S = { 0 , 1 , } , the decision-maker takes an action a from the allowed action set A ( i ) . When the system is empty, we may impose that A ( 0 ) : = 0 . For each i 1 , let A ( i ) : = [ μ 1 , μ 2 ] with constants μ 2 > μ 1 > 0 , μ [ μ 1 , μ 2 ] , which may increase or decrease the service rate. This action incurs a cost c ( i , a ) . In addition, suppose that there is a benefit represented by p > 0 for each arriving customer, and then the net income of the system is

r ( i , a ) = p i c ( i , a ) (4.1)

This is a continuous time MDP model, the corresponding transition rate are given as follows.

For each a [ μ 1 , μ 2 ] ,

q ( 0 | 0 , 0 ) = q ( 1 | 0 , 0 ) : = λ , q ( j | 0 , 0 ) = 0 , j 2 , (4.2)

For all i 1 , a A ( i ) = [ μ 1 , μ 2 ] ,

q ( j | i , a ) : = { λ i if j = i + 1 , ( λ + u ) i + a if j = i , μ i a if j = i 1 , 0 otherwise , (4.3)

Our goal is to find the existence of a variance optimal policy. To this end, we consider the following assumptions:

D1. μ λ > 0 .

D2. The function c ( i , a ) is continuous in a A ( i ) for each fixed i S , and sup a A ( i ) | c ( i , a ) | < M ˜ ( i + 1 ) for all i S , for some constant M ˜ 0 .

Proposition 3: under conditions D1, D2, the above controlled satisfies Assumptions 1, 2, 3, and 4. Therefore, there exists an variance optimal stationary policy.

Proof: Let c 1 : = 1 3 ( μ λ ) , b 1 : = μ 2 + λ , ω ( i ) : = i + 1 , for all i S . Then, from (4.2) and (4.3), we have

j S ω ( j ) q ( j | 0 , a ) = λ c 1 ω ( 0 ) + μ 2 + λ a A ( i ) (4.4)

Moreover, for all i 1 , a [ μ 1 , μ 2 ]

j S ω ( j ) q ( j | i , a ) = ( μ λ ) i + a c 1 ω ( i ) + μ 2 + λ . (4.5)

which verifies Assumption 1(a).

On the other hand, by (4.2)-(4.3), we have

q ( i ) ( μ + λ ) ( i + 1 ) = ( μ + λ ) ω ( i ) , (4.6)

So Assumption 1(b) follows.

By (4.1) and D2, we have | r ( i , a ) | ( p + M ˜ ) ω ( i ) , for all i S , which implies Assumption 1(c).

By (4.2)-(4.3) and D2, we see that Assumption 1(d) holds.

To verity Assumption 1(e), let

ω ( i ) : = ( i + 1 ) ( i + 2 ) , for each i S (4.7)

Then by (4.2)-(4.3) we have

q ( i ) ω ( i ) ( μ + λ ) ω ( i ) i S (4.8)

j S ω ( j ) q ( j | i , a ) ( 4 λ + μ 2 ) ω ( i ) a [ μ 1 , μ 2 ] , i S (4.9)

which imply Assumption 1(e) with

L 1 : = ( μ + λ ) , c : = 4 λ + μ 2 , b : = 0.

Obviously, Assumption 2 follows from the description of the model.

We verity Assumption 3, by D1 and (4.2)-(4.3), for all i S , i A ( i ) , we have

j S ω 2 ( j ) q ( j | 0 , a ) = 3 λ 1 2 ( μ λ ) ω 2 ( 0 ) + b 2 , (4.10)

(there is b 2 > 0 , (presence is guaranteed by (D1))

j S ω 3 ( j ) q ( j | 0 , a ) = 7 λ ( μ λ ) ω 3 ( 0 ) + b 3 . (4.11)

(there is b 3 > 0 , (presence is guaranteed by (D1))

For i 1 , a A ( i ) = [ μ 1 , μ 2 ] ,

j s ω 2 ( j ) q ( j | i , a ) = 2 ( μ λ ) i 2 ( μ 2 a 3 λ ) i + a 2 ( μ λ ) ( i + 1 ) 2 + 4 ( μ λ ) i + 2 ( μ λ ) + ( 3 λ + 2 a ) i + a 1 2 ( μ λ ) ω 2 ( i ) + b 2 (4.12)

j s ω 3 ( j ) q ( j | i , a ) = 3 ( μ λ ) i 3 + 3 ( 3 λ μ + a ) i 2 ( μ 7 λ 3 a ) i + a = 3 ( μ λ ) ( i + 1 ) 3 + 9 ( μ λ ) i 2 + 9 ( μ λ ) i + 3 ( 3 λ μ + a ) i 2 ( μ 7 λ 3 a ) i + a ( μ λ ) ω 3 ( i ) + b 3 (4.13)

q ω 2 ( i ) ( μ + λ ) ω ( i ) i S (4.14)

Let ω ( i ) : = ω 3 ( i ) = ( i + 1 ) 3 , c 2 : = 1 2 ( μ λ ) , b 2 as above, c : = ( μ λ ) , b = b 3 and L 2 = μ + λ , by (4.10)-(4.14), we see that Assumption 1(d) holds.

Finally, we verify Assumption4, by (4.2)-(4.3) we have, for each fixed f F ,

j k q ( j | i , f ( i ) ) j k q ( j | i + 1 , f ( i + 1 ) ) i , k S , k i + 1

which, together with Proposition C.16 of [3] , implies that the corresponding Markov process X ( t ) is stochastically ordered. Thus, Assumption 4(a) follows from Proposition 7.6. Similarly, the assumption 4(b) is established.

Assumption 1, 2, 3 and 4 holds. So, the Proposition is proved.

5. Discussion and Conclusion

In this article, it defines the variance optimization problem of continuous-time Markov decision processes, which is different from the mean-variance optimization problem previously studied. By defining pseudo-variance, the deviation of the deterministic stationary policy f and the Poisson equation, a series of concepts and theorems, we prove the existence of the variance optimal strategy in the deterministic stationary policy space, and give the policy iterative algorithm to calculate optimal policy. Finally we prove the convergence of the policy iterative algorithm.

Cite this paper
Fu, Y. (2019) Variance Optimization for Continuous-Time Markov Decision Processes. Open Journal of Statistics, 9, 181-195. doi: 10.4236/ojs.2019.92014.
References
[1]   Puterman, M.L. (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons Ltd., New York.
https://doi.org/10.1002/9780470316887

[2]   Liu, K. and Cao, P. (2015) Theory and Application of Markov Decision Processes. Science Press, Beijing.

[3]   Guo, X.P. and Hernandez-lerma, O. (2009) Continuous-Time Markov Decision Processes. Springer, Berlin, Heidelberg.

[4]   Markowitz, H. (1952) Portfolio Selection. The Journal of Finance, 7, 77-91.

[5]   Bielecki, T.R., Jin, H.Q., Pliska, S.R. and Zhou, X.Y. (2005) Continuous-Time Mean-Variance Portfolio Selection with Bankruptcy Prohibition. Mathematical Finance, 15, 213-244.
https://doi.org/10.1111/j.0960-1627.2005.00218.x

[6]   Costa, O.L.V., Maiali, A.C. and de Pinto, A.D.C. (2010) Sampled Control for Mean-Variance Hedging in a Jump Diffusion Financial Market. IEEE Transactions on Automatic Control, 55, 1704-1709.
https://doi.org/10.1109/TAC.2010.2046923

[7]   Fu, C.P., Lari-Lavassani, A. and Li, X. (2010) Dynamic Mean-Variance Portfolio Selection with Borrowing Constraint. European Journal of Operations Research, 200, 312-319.
https://doi.org/10.1016/j.ejor.2009.01.005

[8]   Markowitz, H.M. (1987) Mean-Variance Analysis in Portfolio Choice and Capital Markets. Basil Blackwell, Oxford, UK.

[9]   Xiong, J. and Zhou, X.Y. (2007) Man-Variance Portfolio Selection under Partial Information. SIAM Journal on Control and Optimization, 46, 156-175.
https://doi.org/10.1137/050641132

[10]   Yin, G. and Zhou, X.Y. (2004) Markowitz’s Mean-Variance Portfolio Selection with Regime Switching: From Discrete-Time Models to Their Continuous-Time Limits. IEEE Transactions on Automatic Control, 49, 349-360.
https://doi.org/10.1109/TAC.2004.824479

[11]   Zhou, X.Y. and Li, D. (2000) Continuous-Time Mean-Variance Portfolio Selection: A Stochastic LQ Framework. Applied Mathematics and Optimization, 42, 19-33.
https://doi.org/10.1007/s002450010003

[12]   Zhou, X.Y. and Yin, G. (2003) Markowitz’s Mean-Variance Portfolio Selection with Regime Switching: A Continuous-Time Model. SIAM Journal on Control Optimization, 42, 1466-1482.
https://doi.org/10.1137/S0363012902405583

[13]   Mannor, S. and Tsitsiklis, J.N. (2013) Algorithmic Aspects of Mean-Variance Optimization in Markov Decision Processes. European Journal of Operational Research, 231, 645-653.
https://doi.org/10.1016/j.ejor.2013.06.019

[14]   Guo, X.P. and Song, X.Y. (2009) Mean-Variance Criteria for Finite Continuous-Time Markov Decision Processes. IEEE Transactions on Automatic Control, 54, 2151-2157.
https://doi.org/10.1109/TAC.2009.2023833

[15]   Guo, X.P., Ye, L.E. and George, Y. (2012) A Mean-Variance Optimization Problem for Discounted Markov Decision Processes. European Journal of Operational Research, 220, 423-429.
https://doi.org/10.1016/j.ejor.2012.01.051

[16]   Ye, L.E. and Huang, X.X. (2014) Variance Optimization for Continuous-Time Markov Decision Processes. China Science Journal, 44, 883-898.

[17]   Xia, L. (2016) Optimization of Markov Decision Processes under the Variance Criterion. Automatica, 73, 269-278.
https://doi.org/10.1016/j.automatica.2016.06.018

[18]   Xia, L. (2018) Mean-Variance Optimization of Discrete Time Discounted Markov Decision Processes. Automatica, 88, 76-82.
https://doi.org/10.1016/j.automatica.2017.11.012

 
 
Top