mjtsai1974's Dev Blog Welcome to mjt's AI world

Policy Iteration

Prologue To Policy Iteration

Prior post reveals that value iteration eventually leads us to the optimal policy providing the action for the current state to take to get its maximum value when transiting to next state. Departuring from multiple states in one MDP model, this article would guide you through value iteration to find the greedy policy in each state's transition, and finally get the optimal policy for each state. The whole process is called policy iteration.

Policy Iteration Versus Value Iteration

[1] The beginning in MDP

Suppoose we are departuring from a beginning state in one MDP model, each time we do make a decision to choose the best action that can transit us to next state with the maximum value(or reward), and we are doing this over and over again, until we believe that we have build the optimal policy for each state, and it brings the whole MDP problem to a convergence.

This means that we are not facing the problem to choose the action of uncertainty in one single state, and the same state would then be re-visited, since we are repeating this over and over again.

There is no doubts that different or similar policies in one same state often interleave one over the others

, that’s the policy iteration process.

[2] The concept of policy iteration

We have actually involved below 3 phases on the way to solve a MDP issue:
➀initially, $\forall S, Q_{0}(S)=0$…[A]
➁proceeds with policy improvement
$\forall S, \pi_{t}(S)$=$maxarg_{A}Q_{t}(S,A), t\geq 0$…[B]
➂do the policy evaluation task
$\forall S, Q_{t+1}(S)$=$Q^{\pi_{t}}(S)$…[C]

We are starting by picking any arbitrary $Q$ function, denote it the initialization step, that’s [A].

After that, we just iterate by taking the $t$-th, $Q_{t}$ function, computing its greedy policy, $\pi_{t}(S)$, that’s [B], then use that greedy policy to get a new $Q$ function, say $Q_{t+1}$, that’s [C].

And we are repeating and iterating this over and over again. So each time we go around in this loop, we rae taking our previous $Q$ function, finding its policy, taking that policy to find its next value function, such repeating would actually get convergence in finite time.

[3] Policy iteration is a better way

The sequence of $Q$ functions we get convergence to $Q^{\ast}$ after we are experienced a series of value iteration in finite time, in particular, it illustrates how policy iteration works implicitely in my prior post.

The convergence of policy iteration is at least as fast as value iteration in that if at any point we sync up the $Q$ functions, we start value iteration and policy iteration from the same $Q$ function. Then, each step this policy iteration takes is moving us towards the optimal $Q$ function, no more slowly than value iteration.

So, policy iteration is a better way.

[4] Concerns pertaining to policy iteration::mjtsai1974

One thing needs to be taken into consideration when we are doing the policy improvement task, that is faster convergence would be at the cost of greater computational expense.

In particular, when transiting from $S_{i}$ to $S_{i+1}$ in stage $t$, we’d like to make policy evaluation, $\forall S, Q_{t+1}(S)$=$Q^{\pi_{t}}(S)$, by taking the policy and working out $Q_{t+1}$ function to be used by $S_{i}$ as its value obtained, after transiting to $S_{i+1}$.

When in stage $t+1$, it does the policy improvement again, with the new updated(or even the same) policy come out, then $Q_{t+2}$ function would laterly be generated out, this cycle repeats, untile it converges.

Yes, it is solving a system of linaer equations.

The scenario might be constructed by putting value iteration inside the inner loop wrapper by the outside loop of policy iteration, since the value iteration operates in accordance to the policy, exactly, the action mapped from current state by current greedy policy at this moment. So, the value iteration should be in the inner loop of policy iteration.

Trivially, the policy iteration is doing much works than value iteration, if the greedy policy seldomly changes on its way to convergence, it should be at least as fast as value iteration.

Why Does Policy Iteration Works?

[1] What is the convergence time?

➀we know there exists some MDPs, the number of iterations the policy iteration takes is linear, it is at lease as large as the number of states in the MDP.
➁the worse case is that it can't be larger than the number of actions raised to the number of states.

$\;\;l$inear $\vert S\vert\leq$ convergence time $\leq \vert A\vert^{\vert S \vert}$

[2] Domination

➀the domination:
$\pi_{1}\geq\pi_{2}$ iff $\forall s\in S, V^{\pi_{1}}(s)\geq V^{\pi_{2}}(s)$
➁the strict domination:
$\pi_{1}>\pi_{2}$ iff $\forall s\in S, V^{\pi_{1}}(s)\geq V^{\pi_{2}}(s)$
there exists some $s\in S, V^{\pi_{1}}(s)> V^{\pi_{2}}(s)$
➂$\pi$ is $\varepsilon$ optimal iff $\vert V^{\pi}(s)-V^{\pi^{\ast}}(s)\vert\leq\varepsilon$

Expand further that a policy is $\varepsilon$ optimal if the value function for that (greedy) policy is $\varepsilon$ close to the value function for the optimal policy at all states. It gives us a concept of bounding loss or bounding regret.

We can treat a policy to be nearly optimal if per step loss in each state transition is very very small.

From above we can tell that if policy 1 dominates policy 2, but, not strictly dominates it, then they must have the same value everywhere. Since it would never be greater than or equal to, it’s always equal to.

[3] Bellman operator is monotonic

We define 2 Bellman operator on 2 policies, they are $B_{1}$, $B_{2}$ on $\pi_{1}$, $\pi_{2}$:
➀$\lbrack B_{1}V\rbrack$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,\pi_{1})\cdot V(S’)$
➁$\lbrack B_{2}V\rbrack$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,\pi_{2})\cdot V(S’)$

So, if we use $B_{1}$ over value function, it just returns the immediate reward, plus the discounted expected value following $\pi_{1}$, the same idea for $B_{2}$, and it is using $\pi_{2}$.

We already know a bunch of things about such updates. If we apply the same Bellman operator on distinct value functions, then they are not moving further apart, by a factor of at least gamma, unless these 2 value functions are the same.

If ther are already perfectly together, staying at the same fixed point, then they won't move any close together.

[4] Policy iteration behaves itself

$\;\;$ if $V_{1}\geq V_{2}$, then $B_{2}V_{1}\geq B_{2}V_{2}$.

proof::mjtsai1974

$\lbrack B_{2}V_{1}-B_{2}V_{2}\rbrack(S)$
=$\gamma\cdot\sum_{S’}P(S’\vert S, \pi_{2})\cdot (V_{1}(S’)-V_{2}(S’))$
, where the term $(V_{1}(S’)-V_{2}(S’))\geq 0$ by given.

Then, we take a convex combination of a bunch of non-negative values by summing them up, there is no way it would be negative.

We thus proved that $V_{1}$ dominates $V_{2}$ leads to $B_{2}V_{1}$ dominates $B_{2}V_{2}$.

[5] Cautions::mjtsai1974

The same Bellman operator with the same greedy policy, might not guarantee $S_{1}$ and $S_{2}$ could get closer after this transition.

Because the same policy working on different states might not come out with the same action, even the same action might leads to different one step transition.

In above proof, we are applying on the same state, but different value functions.

Value Improvement In Policy Iteration

[1] The background

Suppose in the beginning, we are given Bellman operator $B_{1}$ applied on some policy $\pi_{1}$ with respect to value function $Q_{1}$, such that $Q_{1}$=$B_{1}Q_{1}$ holds.

Still more, there exists another Bellman operator $B_{2}$ applied on a greedy policy $\pi_{2}$ with respect to value function $Q_{1}$, such that $Q_{1}\leq B_{1}Q_{1}$ holds.

Above construction holds by contraction property. We have this inequality:
$Q_{1}(S,A)$
=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S, \pi_{1}(S’))\cdot Q_{1}(S’,\pi_{1}(S’))$
$\;\;\leq R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S, \pi_{2}(S’))\cdot Q_{1}(S’,\pi_{2}(S’))$

Image we are beginning from state $S$ by using $B_{1}$ on $\pi_{1}$ to reach $Q_{1}$, as it is the fixed point of $B_{1}$, that is to say you start off with the policy, and you get the value function of that policy.

Then, we might back to the same state $S$, since we are repeating the value iteration over and over again, thus back to the original departuring point, under the case we are in the MDP model already known; or we are just evaluating over the sampling data, trying to complete the contour of this MDP model, eventhough, we still have the chance to re-visit the state we have ever arrived, say the initial state $S$.

This time, intuitively, we take $B_{2}$ following up $\pi_{2}$, this greedy policy would turn our value function $Q_{1}$ no worse, possibly better, that is $Q_{1}\leq B_{2}Q_{1}$.

[2] The policy iteration moves in the right direction

We have put all the things together to prove policy iteration moves in the right direction:
➀the Bellman operator monotonicity
➁the value improvement property
➂the definition of $\pi_{1}$,$\pi_{2}$,$B_{1}$,$B_{2}$,$Q_{1}$,$Q_{2}$

[3] The policy iteration in a summary

➀$B_{2}Q_{1}\geq Q_{1}$…value improvement
➁$k\geq 0, B_{2}^{k+1}Q_{1}\geq B_{2}^{k}Q_{1}$…monotonicity
➂$lim_{k\rightarrow\infty}B_{2}^{k}Q_{1}\geq Q_{1}$…transitinity
➃$Q_{2}\geq Q_{1}$…fixed point

Policy Iteration Won't Get Stuck In Local Optimal

[1] The important thing

If there is any way to impriove, it will actually improve, it won't get stuck in local optimal. This is really the very important thing.

[2] Value non deprovement or value improvement

If it’s the case that $\pi_{1}$ is optimal, then the greedy policy $\pi_{2}$ would just lead the value function to the the same fixed point. We are just stuck at the same fixed point. Because we have already reached the fixed point, that’s the whole thing.

If it’s the case that $\pi_{1}$ is not optimal, there has some space that makes value non deprovement equivalent to value improvement and it holds for this inequality:
$Q_{1}(S,A)$
=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S, \pi_{1}(S’))\cdot Q_{1}(S’,\pi_{1}(S’))$
$\;\;\leq R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S, \pi_{2}(S’))\cdot Q_{1}(S’,\pi_{2}(S’))$

The important thing is that it might like strict value improvement for some states, replacing $\geq$ by $>$.

Illustration: More On Value Non-deprovement And Value Improvement

[Question]

Suppose I have 2 states, and keep on swapping back and forth in between $S_{1}$ and $S_{2}$ per iteration, to evaluate which one is betterthan the other. The question is that after swapping, do you still have domination in both cases? The MDP model is exhibitted below:
Trivially, in $S_{1}$, $V_{1}$ is more value than $S_{1}$, $V_{2}$.

[Answer] No, because domination is point wise, state wise.

If you don’t consider value non-deprovement, you can swap in between $S_{1}$ and $S_{2}$. Suppose you are looping over and over again in this MDP, time to converge, at this moment in $S_{1}$, with $V_{1}> V_{2}$, after you transite to $S_{2}$, you have value functions $V_{1}\geq V_{2}$, which is a violation. Value non-deprovement and value Improvement don't allow value function go down, it should at least keep the same!!!

The point is that the statement is for all the states, for their value functions, once $V_{1}$ is strictly dominating $V_{2}$ in $S_{1}$, after transiting to $S_{2}$, should keep it that way.

So, local stuck doesn't happen if we are already in the unique fixed point with strict domination, in this example $S_{1}$; however, if we are beginning from $S_{2}$, if should transite to $S_{1}$ for no worse, even better.

[Notes::mjtsai1974] By policy iteration, we'll always get a better policy.

Since if we are starting from $S_{1}$ by $\pi_{1}$, the initial policy of something, by picking up $A_{1}$ to transite to $S_{2}$. We are at least one step toward the final optimal in $S_{2}$ at this moment, or it is not worse, might be the same, even better than it is in $S_{1}$.

Addendum

Advanced, algorithmic, analysis, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)