mjtsai1974's Dev Blog Welcome to mjt's AI world

Model-Based RL Algorithm RMAX - Part 2

Prologue To Model-Based RL Algorithm RMAX - Part 2

The RMAX theorem guarantees that the execution of algorithm will attain 2ε close to optimal expected return in polynomial time, whereas the algorithm is developed by constructing a simulated model plugging with an extra fictitious G0 state to approximate the target model. To validate the algorithm, this article would like to prove that the difference of the expected returns in between the simulated and the target model is infinitesimal.

α Approximation

Given that M and M are 2 distinct stochastic games over the same set of states and actions, then M is α approximation of M if below holds for every state s:
PM(s,a,a,s)αPM(s,a,a,s)PM(s,a,a,s)+α
, where PM(s,a,a,s) is the probabilistic transition for the agent from state s to s by choosing action a, and the adversary chooses action a in model M, the same for PM(s,s,a,a) is the same.
➁for every state s, the same stage game is associated with s in M and M, the rewards would be identical.

Approximation Makes Small Difference

Let M and M be stochastic games over N states, where M is an εNTRmax approximation of M, then for every state s, agent policy π, adversary policy ρ, we have that

  • |UM(s,π,ρ,T)UM(s,π,ρ,T)|ε
    , where UM(s,π,ρ,T) is the optimal expected reward with regards to agent policy π, adversary policy ρ, given that they are from state s, make state transitions over T steps.
proof::mjtsai1974

➀begin from
|UM(s,π,ρ,T)UM(s,π,ρ,T)|
=p|PM(p)UM(p)PM(p)UM(p)|

, where p is the path starting from each state s, totally N states, after ranging from s, make state transition of T steps.

➁each such optimal expexted reward of path p in M could be expressed as:
ssPM(s,a,a,s)UM(s)

, where UM(s)=Ti=1γi1VM(Si)

➂origin formula in ➀ becomes:
ss(PM(s,a,a,s)UM(s)-PM(s,a,a,s)UM(s))

, where s={s1,s2,,sN}

➃we are given M is an α approximation of M, and
UM(s)=Ti=1γi1VM(si)Ti=1VM(s) must hold

|UM(s,π,ρ,T)UM(s,π,ρ,T)|
=ss(PM(s,a,a,s)UM(s)-PM(s,a,a,s)UM(s))
=ss(PM(s,a,a,s)Ti=1VM(s)-PM(s,a,a,s)Ti=1VM(s))
=ss(PM(s,a,a,s)-PM(s,a,a,s)T(VM(s)VM(s)))
ss(εNTRmaxT(VM(s)VM(s)))
…by given M is an εNTRmax approximation of M
ss(εNTRmaxTRmax)
all optimal expected rewards are bounded by Rmax in this algorithm, so is the difference
NεNTRmaxTRmax=ε
…totally N such states, each distinct s has lots of s as its next state, the transitive probability of the same s could then be summed up to 1, thus ss=N

Addendum

R-max: A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning, Ronen I. Brafman, CS in Ben-Gurion University, Moshe Tennenholtz, CS in Stanford University