More On Value Iteration
21 Apr 2019Prologue To More On Value Iteration
$\frac {1}{1-\gamma}$ Bits Of Precision
We’d like to dive a little bit to relate the horizontal length to the convergence, by below given:
[1] There is some time before infinity
➀denote the dimensionality of states and actions as $\vert S\vert$ and $\vert A \vert$.
➁also with $R_{max}$=$max_{S,A}\vert R(S,A)\vert$.
➂$\frac {1}{1-\gamma}$ as bits of precision.
➃for some $t^{\ast}$ polynomial in $\vert S\vert$ and $vert A \vert$.➀$\pi(S)$=$argmax_{A}Q_{t^{\ast}}(S,A)$ is optimal.
➁we treat the $Q^{\ast}$ to be the $Q$ function after we iterate over $t^{\ast}$ times, we know that it converges in the limit, $Q_{t}$ eventually goes to $Q^{\ast}$.$\Rightarrow$Here is the hint that we found there is some $t^{\ast}$ of finite horizon, less than infinity, that’s polynomial in:
➀the number of states
➁the number of actions
➂the magnitude of the rewards in the reward function
➃the number of bits of precision that are used to specified the transition probability
➄number of bits of $\gamma$ in $\frac {1}{1-\gamma}$$\Rightarrow$So, that, if we run value iteration for that many steps, the $Q$ function that we get out is $Q_{t^{\ast}}$ of $(S,A)$. If we define a policy $\pi(S,A)$, which is just the greedy policy with respect to that $Q$ function, then that policy is optimal.
$\Rightarrow$We know that in the limit, if we run value iteration for an infinite number of steps, then the $Q$ function that we get at that point, the greedy policy with respect to the $Q$ function, is just optimal.
$\Rightarrow$This leads to a finding that there is sometime before infinity where we get a $Q$ function that's close enough, so that if you do the greedy policy with respect to it(the $Q$ function), it really is optimal. Like all the way $100\%$ optimal.
$\Rightarrow$What that really means is that it's polynomial(all the way $100\%$ optimal) and you can actually solve this in a reasonable amount of time.
[2] $\frac {1}{1-\gamma}$ is not so just polynomialThe whole convergence might not be so just polynomial, why? The $\gamma$ in $\frac {1}{1-\gamma}$ is the key factor, as $\gamma\rightarrow 1$, this blows up and that's not polynomial bounded, say the number of bits it takes to write down $\gamma$.
$\Rightarrow$So, it’s exponential in terms of the number of bits it takes to write down the whole problem.
$\Rightarrow$But even so, if you take $\gamma$ to be some fixed constant, the $\frac {1}{1-\gamma}$ might be really big, but, it’s some fixed constant and we are polynomial in all the rest of $\vert S\vert$,$\vert A\vert$, $R_{max}=max_{S,A}R(S,A)$.
$\Rightarrow$This leads to an idea that once you fix an MDP model and the number of bits of precision $\gamma$, there is going to be some optimal action and there might be other actions that are tied for optimal in any given state.
$\Rightarrow$One thing to be noted about the second best action is going to be some bounded amount away from optimal. It can't get any arbitrary close to optimal. It’s going to be some distance away. This says when we run value iteration enough, eventually, the final distance between the best action and the second best action gets bigger than this original gap.
➀once it’s bigger than this gap, then the greedy policy is going to be functioning work as optimal policy.
➁it's going to start to choose the best action by optimal policy in all states, maybe at this moment.$\Rightarrow$The greedy policy with respect to value iteration converges in a reasonable amount of time.
The Largest Difference Between Optimal Policy And Greedy Policy: $max_{S}\vert V^{\pi_{V_{t}}}(S)-V^{\ast}(S)\vert$<$\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$
[1] The backgroundif $\vert V_{t}(S)-V_{t+1}(S)\vert<\varepsilon$,
$\;\;\forall S$, then, $max_{S}\vert V^{\pi_{V_{t}}}(S)-V^{\ast}(S)\vert$<$\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$Here is the assumption that
➀we run value iteration and get a series of value functions, say $V_{t}$.
➁the amount of changes of $V_{t}$, from $t$ to $t+1$, is less than $\varepsilon$ for all states $S$ in this MDP model.Then, the largest difference between the optimal value function $V^{\ast}(S)$ and the value function we get by taking the greedy policy with respect to the value function at time $t$ is small, say $\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$.
[2] Why end up being $\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$?➀because this($\frac {1}{1-\gamma}$) little bit of $\varepsilon$ that we could be off, could be magnified, even every single step as we think about this into future. That’s why we get a $1-\gamma$ in the denominator.
➁there exists $2$ sides of $\varepsilon$, positive and negative for each, that’s why we have $2$ times $\varepsilon$.
➂suppose we are at $t$, such change doesn't take place right now at $t$, it takes one step further into $t+1$, then we need to times $\gamma$.If you have a good enough approximation of the value function, then you can at least put a bound on ➀how far off you are in terms of following the resulting greedy policy value function $V^{\pi_{V_{t}}}(S)$ from ➁how far that is following the optimal policy value function $V^{\ast}(S)$.
[3] Should we need to know how closed we are to $V^{\ast}$?We don't even need to know how closed we are to $V^{\ast}$, all we need is $\vert V_{t}(S)-V_{t+1}(S)\vert$<$\varepsilon$ for all $S$, that is any two consecutive iteration of value iteration:
[4] Pick up smaller $\gamma$
➀that just gives us a bound on what if we stops now and follow that policy $\pi_{V_{t}}$.
➁knowning that value iteration eventually converges in the limit of $\varepsilon$ is not so helpful, because you don't know when the converges takes place.
➂$max_{S}\vert V^{\pi_{V_{t}}}(S)-V^{\ast}(S)\vert$<$\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$ is telling us when is the decent time(right moment) to stop.If $\gamma$ is tiny, then, $\frac {1}{1-\gamma}$ stays small, $\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$ also stays small. It’s all good.
The smaller and larger $\gamma$ leads to a trade off between the horizon and the feasibility of solving a problem in a rreasonable amount of time.
➀the more closer $\gamma$ gets to $1$, the more problematic your learning convergence would be.
➁$\gamma$ tells you what your horizon is.
➂the smaller $\gamma$ puts small weight on the relation of next state to the current state, then the state convergence would not over too long a horizon.
➃as $\gamma$ gets closer to $1$, $\frac {1}{1-\gamma}$ would just explodes, it takes you to look ahead a lot further, it’s a harder problem, since it is over a longer horizon.
$\vert\vert B^{k}Q_{1}-B^{k}Q_{2}\vert\vert_{\infty}\leq\gamma^{k}\vert\vert Q_{1}-Q_{2}\vert\vert_{\infty}$
if you start from with $Q_{1}$ and we run $k$ steps of value iteration, in other words, we apply the Bellman operator $k$ times, that is to say the $k$ step Bellman operator is a contraction mapping.
The index of contraction is just like $\gamma$ raised to the $k$, which is a much much smaller number.
So running $k$ steps of value iterations actually makes your value functions much closer together than one single step value iteration.
Cautions must be made that $\gamma$ raised to the $k$ is not linear, it’s like gamma squared, gamma cubed, gamma to the $k$, that’s the index of contraction. So, if we run $10$ steps of value iteration, that’s like running $1$ step in $10$ steps of value iteration, because the difference $<\frac {2\cdot\varepsilon\cdot\gamma}{1-\gamma}$ always holds.
$\vert\vert B^{k}Q_{1}-B^{k}Q_{2}\vert\vert_{\infty}\leq\gamma^{k}\vert\vert Q_{1}-Q_{2}\vert\vert_{\infty}$gives us a way of quantifying how long we run value iteration and connecting it with how close you are to optimal after you run $k$ steps of value iteration.