Model-Based RL Algorithm RMAX - Part 2
23 Feb 2020Prologue To Model-Based RL Algorithm RMAX - Part 2
$\alpha$ Approximation
Given that $M$ and $M^{'}$ are 2 distinct stochastic games over the same set of states and actions, then $M^{'}$ is $\alpha$ approximation of $M$ if below holds for every state $s$:
➀$P_{M}(s,a,a^{'},s^{'})-\alpha\leq P_{M^{'}}(s,a,a^{'},s^{'})\leq P_{M}(s,a,a^{'},s^{'})+\alpha$
, where $P_{M}(s,a,a^{'},s^{'})$ is the probabilistic transition for the agent from state $s$ to $s^{'}$ by choosing action $a$, and the adversary chooses action $a^{'}$ in model $M$, the same for $P_{M^{'}}(s,s^{'},a,a^{'})$ is the same.
➁for every state $s$, the same stage game is associated with $s$ in $M$ and $M^{'}$, the rewards would be identical.
Approximation Makes Small Difference
Let $M$ and $M^{'}$ be stochastic games over $N$ states, where $M^{'}$ is an $\frac {\varepsilon}{N\cdot T\cdot R_{max}}$ approximation of $M$, then for every state $s$, agent policy $\pi$, adversary policy $\rho$, we have that
proof::mjtsai1974
- $\left|U_{M^{'}}(s,\pi,\rho,T)-U_{M}(s,\pi,\rho,T)\right|\leq\varepsilon$
, where $U_{M}(s,\pi,\rho,T)$ is the optimal expected reward with regards to agent policy $\pi$, adversary policy $\rho$, given that they are from state $s$, make state transitions over $T$ steps.➀begin from
$\left|U_{M^{'}}(s,\pi,\rho,T)-U_{M}(s,\pi,\rho,T)\right|$
=$\sum_{p}\left|P_{M^{'}}(p)\cdot U_{M^{'}}(p) - P_{M}(p)\cdot U_{M}(p)\right|$, where $p$ is the path starting from each state $s$, totally $N$ states, after ranging from $s$, make state transition of $T$ steps.
➁each such optimal expexted reward of path $p$ in $M^{'}$ could be expressed as:
$\sum_{s}\sum_{s’}P_{M^{'}}(s,a,a^{'},s^{'})\cdot U_{M^{'}}(s^{'})$, where $U_{M^{'}}(s^{'})$=$\sum_{i=1}^{T}\gamma^{i-1}\cdot V_{M^{'}}(S_{i}^{'})$
➂origin formula in ➀ becomes:
$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})\cdot U_{M^{'}}(s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot U_{M}(s^{'}))$, where $s$=$\{s_{1},s_{2},…,s_{N}\}$
➃we are given $M^{'}$ is an $\alpha$ approximation of $M$, and
$U_{M^{'}}(s^{'})$=$\sum_{i=1}^{T}\gamma^{i-1}\cdot V_{M^{'}}(s_{i}^{'})\leq\sum_{i=1}^{T}V_{M^{'}}(s^{'})$ must hold➄$\left|U_{M^{'}}(s,\pi,\rho,T)-U_{M}(s,\pi,\rho,T)\right|$
=$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})\cdot U_{M^{'}}(s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot U_{M}(s^{'}))$
=$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})\cdot\sum_{i=1}^{T}V_{M^{'}}(s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot\sum_{i=1}^{T}V_{M}(s^{'}))$
=$\sum_{s}\sum_{s’}(P_{M^{'}}(s,a,a^{'},s^{'})$-$P_{M}(s,a,a^{'},s^{'})\cdot T\cdot(V_{M^{'}}(s^{'})-V_{M}(s^{'})))$
$\leq\sum_{s}\sum_{s’}(\frac {\varepsilon}{N\cdot T\cdot R_{max}}\cdot T\cdot(V_{M^{'}}(s^{'})-V_{M}(s^{'})))$
$\;\;$…by given $M^{'}$ is an $\frac {\varepsilon}{N\cdot T\cdot R_{max}}$ approximation of $M$
$\leq\sum_{s}\sum_{s’}(\frac {\varepsilon}{N\cdot T\cdot R_{max}}\cdot T\cdot R_{max})$
$\;\;$…all optimal expected rewards are bounded by $R_{max}$ in this algorithm, so is the difference
$\leq N\cdot\frac {\varepsilon}{N\cdot T\cdot R_{max}}\cdot T\cdot R_{max}$=$\varepsilon$
$\;\;$…totally $N$ such states, each distinct $s$ has lots of $s^{'}$ as its next state, the transitive probability of the same $s$ could then be summed up to $1$, thus $\sum_{s}\sum_{s^{'}}$=$N$