mjtsai1974's Dev Blog Welcome to mjt's AI world

Model-Based RL Algorithm RMAX - Part 2

Prologue To Model-Based RL Algorithm RMAX - Part 2

The RMAX theorem guarantees that the execution of algorithm will attain $-2\cdot\varepsilon$ close to optimal expected return in polynomial time, whereas the algorithm is developed by constructing a simulated model plugging with an extra fictitious $G_{0}$ state to approximate the target model. To validate the algorithm, this article would like to prove that the difference of the expected returns in between the simulated and the target model is infinitesimal.

$\alpha$ Approximation

Given that $M$ and $M^{'}$ are 2 distinct stochastic games over the same set of states and actions, then $M^{'}$ is $\alpha$ approximation of $M$ if below holds for every state $s$:
➀$P_{M}(s,a,a^{'},s^{'})-\alpha\leq P_{M^{'}}(s,a,a^{'},s^{'})\leq P_{M}(s,a,a^{'},s^{'})+\alpha$
, where $P_{M}(s,a,a^{'},s^{'})$ is the probabilistic transition for the agent from state $s$ to $s^{'}$ by choosing action $a$, and the adversary chooses action $a^{'}$ in model $M$, the same for $P_{M^{'}}(s,s^{'},a,a^{'})$ is the same.
➁for every state $s$, the same stage game is associated with $s$ in $M$ and $M^{'}$, the rewards would be identical.

Approximation Makes Small Difference

Let $M$ and $M^{'}$ be stochastic games over $N$ states, where $M^{'}$ is an $\frac {\varepsilon}{N\cdot T\cdot R_{max}}$ approximation of $M$, then for every state $s$, agent policy $\pi$, adversary policy $\rho$, we have that

  • $\left|U_{M^{'}}(s,\pi,\rho,T)-U_{M}(s,\pi,\rho,T)\right|\leq\varepsilon$
    , where $U_{M}(s,\pi,\rho,T)$ is the optimal expected reward with regards to agent policy $\pi$, adversary policy $\rho$, given that they are from state $s$, make state transitions over $T$ steps.
proof::mjtsai1974

➀begin from
$\left|U_{M^{'}}(s,\pi,\rho,T)-U_{M}(s,\pi,\rho,T)\right|$
=$\sum_{p}\left|P_{M^{'}}(p)\cdot U_{M^{'}}(p) - P_{M}(p)\cdot U_{M}(p)\right|$

, where $p$ is the path starting from each state $s$, totally $N$ states, after ranging from $s$, make state transition of $T$ steps.

➁each such optimal expexted reward of path $p$ in $M^{'}$ could be expressed as:
$\sum_{s}\sum_{s’}P_{M^{'}}(s,a,a^{'},s^{'})\cdot U_{M^{'}}(s^{'})$

, where $U_{M^{'}}(s^{'})$=$\sum_{i=1}^{T}\gamma^{i-1}\cdot V_{M^{'}}(S_{i}^{'})$

➂origin formula in ➀ becomes:
$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})\cdot U_{M^{'}}(s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot U_{M}(s^{'}))$

, where $s$=$\{s_{1},s_{2},…,s_{N}\}$

➃we are given $M^{'}$ is an $\alpha$ approximation of $M$, and
$U_{M^{'}}(s^{'})$=$\sum_{i=1}^{T}\gamma^{i-1}\cdot V_{M^{'}}(s_{i}^{'})\leq\sum_{i=1}^{T}V_{M^{'}}(s^{'})$ must hold

➄$\left|U_{M^{'}}(s,\pi,\rho,T)-U_{M}(s,\pi,\rho,T)\right|$
=$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})\cdot U_{M^{'}}(s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot U_{M}(s^{'}))$
=$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})\cdot\sum_{i=1}^{T}V_{M^{'}}(s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot\sum_{i=1}^{T}V_{M}(s^{'}))$
=$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot T\cdot(V_{M^{'}}(s^{'})-V_{M}(s^{'})))$
$\leq\sum_{s}\sum_{s’}(\frac {\varepsilon}{N\cdot T\cdot R_{max}}\cdot T\cdot(V_{M^{'}}(s^{'})-V_{M}(s^{'})))$
$\;\;$…by given $M^{'}$ is an $\frac {\varepsilon}{N\cdot T\cdot R_{max}}$ approximation of $M$
$\leq\sum_{s}\sum_{s’}(\frac {\varepsilon}{N\cdot T\cdot R_{max}}\cdot T\cdot R_{max})$
$\;\;$…all optimal expected rewards are bounded by $R_{max}$ in this algorithm, so is the difference
$\leq N\cdot\frac {\varepsilon}{N\cdot T\cdot R_{max}}\cdot T\cdot R_{max}$=$\varepsilon$
$\;\;$…totally $N$ such states, each distinct $s$ has lots of $s^{'}$ as its next state, the transitive probability of the same $s$ could then be summed up to $1$, thus $\sum_{s}\sum_{s^{'}}$=$N$

Addendum

R-max: A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning, Ronen I. Brafman, CS in Ben-Gurion University, Moshe Tennenholtz, CS in Stanford University