mjtsai1974's Dev Blog Welcome to mjt's AI world

Policy Iteration

Prologue To Policy Iteration

Prior post reveals that value iteration eventually leads us to the optimal policy providing the action for the current state to take to get its maximum value when transiting to next state. Departuring from multiple states in one MDP model, this article would guide you through value iteration to find the greedy policy in each state's transition, and finally get the optimal policy for each state. The whole process is called policy iteration.

Policy Iteration Versus Value Iteration

[1] The beginning in MDP

Suppoose we are departuring from a beginning state in one MDP model, each time we do make a decision to choose the best action that can transit us to next state with the maximum value(or reward), and we are doing this over and over again, until we believe that we have build the optimal policy for each state, and it brings the whole MDP problem to a convergence.

This means that we are not facing the problem to choose the action of uncertainty in one single state, and the same state would then be re-visited, since we are repeating this over and over again.

There is no doubts that different or similar policies in one same state often interleave one over the others

, that’s the policy iteration process.

[2] The concept of policy iteration

We have actually involved below 3 phases on the way to solve a MDP issue:
➀initially, S,Q0(S)=0…[A]
➁proceeds with policy improvement
S,πt(S)=maxargAQt(S,A),t0…[B]
➂do the policy evaluation task
S,Qt+1(S)=Qπt(S)…[C]

We are starting by picking any arbitrary Q function, denote it the initialization step, that’s [A].

After that, we just iterate by taking the t-th, Qt function, computing its greedy policy, πt(S), that’s [B], then use that greedy policy to get a new Q function, say Qt+1, that’s [C].

And we are repeating and iterating this over and over again. So each time we go around in this loop, we rae taking our previous Q function, finding its policy, taking that policy to find its next value function, such repeating would actually get convergence in finite time.

[3] Policy iteration is a better way

The sequence of Q functions we get convergence to Q after we are experienced a series of value iteration in finite time, in particular, it illustrates how policy iteration works implicitely in my prior post.

The convergence of policy iteration is at least as fast as value iteration in that if at any point we sync up the Q functions, we start value iteration and policy iteration from the same Q function. Then, each step this policy iteration takes is moving us towards the optimal Q function, no more slowly than value iteration.

So, policy iteration is a better way.

[4] Concerns pertaining to policy iteration::mjtsai1974

One thing needs to be taken into consideration when we are doing the policy improvement task, that is faster convergence would be at the cost of greater computational expense.

In particular, when transiting from Si to Si+1 in stage t, we’d like to make policy evaluation, S,Qt+1(S)=Qπt(S), by taking the policy and working out Qt+1 function to be used by Si as its value obtained, after transiting to Si+1.

When in stage t+1, it does the policy improvement again, with the new updated(or even the same) policy come out, then Qt+2 function would laterly be generated out, this cycle repeats, untile it converges.

Yes, it is solving a system of linaer equations.

The scenario might be constructed by putting value iteration inside the inner loop wrapper by the outside loop of policy iteration, since the value iteration operates in accordance to the policy, exactly, the action mapped from current state by current greedy policy at this moment. So, the value iteration should be in the inner loop of policy iteration.

Trivially, the policy iteration is doing much works than value iteration, if the greedy policy seldomly changes on its way to convergence, it should be at least as fast as value iteration.

Why Does Policy Iteration Works?

[1] What is the convergence time?

➀we know there exists some MDPs, the number of iterations the policy iteration takes is linear, it is at lease as large as the number of states in the MDP.
➁the worse case is that it can't be larger than the number of actions raised to the number of states.

linear |S| convergence time |A||S|

[2] Domination

➀the domination:
π1π2 iff sS,Vπ1(s)Vπ2(s)
➁the strict domination:
π1>π2 iff sS,Vπ1(s)Vπ2(s)
there exists some sS,Vπ1(s)>Vπ2(s)
π is ε optimal iff |Vπ(s)Vπ(s)|ε

Expand further that a policy is ε optimal if the value function for that (greedy) policy is ε close to the value function for the optimal policy at all states. It gives us a concept of bounding loss or bounding regret.

We can treat a policy to be nearly optimal if per step loss in each state transition is very very small.

From above we can tell that if policy 1 dominates policy 2, but, not strictly dominates it, then they must have the same value everywhere. Since it would never be greater than or equal to, it’s always equal to.

[3] Bellman operator is monotonic

We define 2 Bellman operator on 2 policies, they are B1, B2 on π1, π2:
[B1V]=R(S,A)+γSP(S|S,π1)V(S)
[B2V]=R(S,A)+γSP(S|S,π2)V(S)

So, if we use B1 over value function, it just returns the immediate reward, plus the discounted expected value following π1, the same idea for B2, and it is using π2.

We already know a bunch of things about such updates. If we apply the same Bellman operator on distinct value functions, then they are not moving further apart, by a factor of at least gamma, unless these 2 value functions are the same.

If ther are already perfectly together, staying at the same fixed point, then they won't move any close together.

[4] Policy iteration behaves itself

if V1V2, then B2V1B2V2.

proof::mjtsai1974

[B2V1B2V2](S)
=γSP(S|S,π2)(V1(S)V2(S))
, where the term (V1(S)V2(S))0 by given.

Then, we take a convex combination of a bunch of non-negative values by summing them up, there is no way it would be negative.

We thus proved that V1 dominates V2 leads to B2V1 dominates B2V2.

[5] Cautions::mjtsai1974

The same Bellman operator with the same greedy policy, might not guarantee S1 and S2 could get closer after this transition.

Because the same policy working on different states might not come out with the same action, even the same action might leads to different one step transition.

In above proof, we are applying on the same state, but different value functions.

Value Improvement In Policy Iteration

[1] The background

Suppose in the beginning, we are given Bellman operator B1 applied on some policy π1 with respect to value function Q1, such that Q1=B1Q1 holds.

Still more, there exists another Bellman operator B2 applied on a greedy policy π2 with respect to value function Q1, such that Q1B1Q1 holds.

Above construction holds by contraction property. We have this inequality:
Q1(S,A)
=R(S,A)+γSP(S|S,π1(S))Q1(S,π1(S))
R(S,A)+γSP(S|S,π2(S))Q1(S,π2(S))

Image we are beginning from state S by using B1 on π1 to reach Q1, as it is the fixed point of B1, that is to say you start off with the policy, and you get the value function of that policy.

Then, we might back to the same state S, since we are repeating the value iteration over and over again, thus back to the original departuring point, under the case we are in the MDP model already known; or we are just evaluating over the sampling data, trying to complete the contour of this MDP model, eventhough, we still have the chance to re-visit the state we have ever arrived, say the initial state S.

This time, intuitively, we take B2 following up π2, this greedy policy would turn our value function Q1 no worse, possibly better, that is Q1B2Q1.

[2] The policy iteration moves in the right direction

We have put all the things together to prove policy iteration moves in the right direction:
➀the Bellman operator monotonicity
➁the value improvement property
➂the definition of π1,π2,B1,B2,Q1,Q2

[3] The policy iteration in a summary

B2Q1Q1value improvement
k0,Bk+12Q1Bk2Q1monotonicity
limkBk2Q1Q1transitinity
Q2Q1fixed point

Policy Iteration Won't Get Stuck In Local Optimal

[1] The important thing

If there is any way to impriove, it will actually improve, it won't get stuck in local optimal. This is really the very important thing.

[2] Value non deprovement or value improvement

If it’s the case that π1 is optimal, then the greedy policy π2 would just lead the value function to the the same fixed point. We are just stuck at the same fixed point. Because we have already reached the fixed point, that’s the whole thing.

If it’s the case that π1 is not optimal, there has some space that makes value non deprovement equivalent to value improvement and it holds for this inequality:
Q1(S,A)
=R(S,A)+γSP(S|S,π1(S))Q1(S,π1(S))
R(S,A)+γSP(S|S,π2(S))Q1(S,π2(S))

The important thing is that it might like strict value improvement for some states, replacing by >.

Illustration: More On Value Non-deprovement And Value Improvement

[Question]

Suppose I have 2 states, and keep on swapping back and forth in between S1 and S2 per iteration, to evaluate which one is betterthan the other. The question is that after swapping, do you still have domination in both cases? The MDP model is exhibitted below:
Trivially, in S1, V1 is more value than S1, V2.

[Answer] No, because domination is point wise, state wise.

If you don’t consider value non-deprovement, you can swap in between S1 and S2. Suppose you are looping over and over again in this MDP, time to converge, at this moment in S1, with V1>V2, after you transite to S2, you have value functions V1V2, which is a violation. Value non-deprovement and value Improvement don't allow value function go down, it should at least keep the same!!!

The point is that the statement is for all the states, for their value functions, once V1 is strictly dominating V2 in S1, after transiting to S2, should keep it that way.

So, local stuck doesn't happen if we are already in the unique fixed point with strict domination, in this example S1; however, if we are beginning from S2, if should transite to S1 for no worse, even better.

[Notes::mjtsai1974] By policy iteration, we'll always get a better policy.

Since if we are starting from S1 by π1, the initial policy of something, by picking up A1 to transite to S2. We are at least one step toward the final optimal in S2 at this moment, or it is not worse, might be the same, even better than it is in S1.

Addendum

Advanced, algorithmic, analysis, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)