mjtsai1974's Dev Blog Welcome to mjt's AI world

Value Iteration Algorithm Detail

Value Iteration Algorithm Detail

An optimal policy maps each distinct state to an optimal action maximizing the value of the state over the horizon of some magnitude or even infinity. With an optimal policy, it's easy to evaluate the expected value from each possible starting state by executing it. All above must be resorted to value iteration.

Value Iteration Algorithm/Flow

Let me try to express value iteration flow in below in this section.

Initialize V0(S)=0 for all S.
Loop until |VtVt1|<ε1γγ

Loop for all S

\[V_{t+1}(S)\leftarrow R(S)+\underset A{max}\left[\gamma\cdot\sum_{S'}P(S'\left|S,A\right.)\cdot V_{t}(S')\right]\]

Such flow comes with below features:
Vt converges to some optimized value, say V.
➁no need to keep Vt+1 v.s. Vt, since it just converges.
asynchronous(can do random state updates).
➃we want |VtV|=maxS|VtV|<ε,
then, the whole case is below ε1γγ, I’d like to show you ➃

proof:
begin from |VtVt+1|=γ|Vt1Vt|
…this holds W.R.T the above value iteration

next, focus on ε(1γ), where Vt=γVt1
…the infinite horizon expected discounted reward

therefore, Vt1Vt=Vt1(1γ)

next, express unknown in terms of \varepsilon and 1-\gamma, this should hold, where \varepsilon is a small and unknown coefficient by design.

next, take unknown=\varepsilon\cdot(1-\gamma), because V_{t-1}-V_tcould be expressed with (1-\gamma) term in it. Then, the whole inequality becomes:
\begin{array}{l}\gamma\cdot V_{t-1}\cdot(1-\gamma)<unknown\\\Rightarrow\gamma\cdot V_{t-1}\cdot(1-\gamma)<\varepsilon\cdot(1-\gamma)\\\Rightarrow\gamma\cdot\left\|V_{t-1}-V_t\right\|<\varepsilon\cdot(1-\gamma)\\\Rightarrow\left\|V_{t-1}-V_t\right\|<\varepsilon\cdot\frac{1-\gamma}\gamma\\\Rightarrow\left\|V_{t+1}-V_t\right\|<\varepsilon\cdot\frac{1-\gamma}\gamma\end{array}
as t increases, we will have V_{t-1}\approx V_t\approx V_{t+1}.