Value Iteration Algorithm Detail

05 Dec 2017

Value Iteration Algorithm Detail

An optimal policy maps each distinct state to an optimal action maximizing the value of the state over the horizon of some magnitude or even infinity. With an optimal policy, it's easy to evaluate the expected value from each possible starting state by executing it. All above must be resorted to value iteration.

Value Iteration Algorithm/Flow

Let me try to express value iteration flow in below in this section.

Initialize $V_0(S)=0$ for all $S$.
Loop until $\left|V_t-V_{t-1}\right|<\varepsilon\cdot\frac{1-\gamma}\gamma$

Loop for all $S$

\[V_{t+1}(S)\leftarrow R(S)+\underset A{max}\left[\gamma\cdot\sum_{S'}P(S'\left|S,A\right.)\cdot V_{t}(S')\right]\]

Such flow comes with below features:
➀$V_{t}$ converges to some optimized value, say $V^{*}$.
➁no need to keep $V_{t+1}$ v.s. $V_{t}$, since it just converges.
➂asynchronous(can do random state updates).
➃we want $\left|V_t-V^\ast\right|=\underset{S’}{max}\left|V_t-V^\ast\right|<\varepsilon$,
then, the whole case is below $\varepsilon\cdot\frac{1-\gamma}\gamma$, I’d like to show you ➃

proof:
begin from $\left|V_t-V_{t+1}\right|=\gamma\cdot\left|V_{t-1}-V_t\right|$
…this holds W.R.T the above value iteration

next, focus on $\varepsilon\cdot(1-\gamma)$, where $V_t=\gamma\cdot V_{t-1}$
…the infinite horizon expected discounted reward

therefore, $V_{t-1}-V_t=V_{t-1}\cdot(1-\gamma)$
$\begin{array}{l}\therefore\left\|V_t-V_{t+1}\right\|\\=\gamma\cdot\left\|V_{t-1}-V_t\right\|<unknown\\=\gamma\cdot V_{t-1}\cdot(1-\gamma)<unknown\end{array}$

next, express $unknown$ in terms of $\varepsilon$ and $1-\gamma$, this should hold, where $\varepsilon$ is a small and unknown coefficient by design.

next, take $unknown=\varepsilon\cdot(1-\gamma)$, because $V_{t-1}-V_t$could be expressed with $(1-\gamma)$ term in it. Then, the whole inequality becomes:
$\begin{array}{l}\gamma\cdot V_{t-1}\cdot(1-\gamma)<unknown\\\Rightarrow\gamma\cdot V_{t-1}\cdot(1-\gamma)<\varepsilon\cdot(1-\gamma)\\\Rightarrow\gamma\cdot\left\|V_{t-1}-V_t\right\|<\varepsilon\cdot(1-\gamma)\\\Rightarrow\left\|V_{t-1}-V_t\right\|<\varepsilon\cdot\frac{1-\gamma}\gamma\\\Rightarrow\left\|V_{t+1}-V_t\right\|<\varepsilon\cdot\frac{1-\gamma}\gamma\end{array}$
as $t$ increases, we will have $V_{t-1}\approx V_t\approx V_{t+1}$.

mjtsai1974's Dev Blog Welcome to mjt's AI world

Value Iteration Algorithm Detail

Value Iteration Algorithm Detail

Value Iteration Algorithm/Flow

Related Posts

Partial Observable Markov Decision Process - Part 4 04 Dec 2020

Partial Observable Markov Decision Process - Part 3 10 Oct 2020

Partial Observable Markov Decision Process - Part 2 13 Aug 2020