Model-Based RL Algorithm RMAX - Part 2
23 Feb 2020Prologue To Model-Based RL Algorithm RMAX - Part 2
α Approximation
Given that M and M′ are 2 distinct stochastic games over the same set of states and actions, then M′ is α approximation of M if below holds for every state s:
➀PM(s,a,a′,s′)−α≤PM′(s,a,a′,s′)≤PM(s,a,a′,s′)+α
, where PM(s,a,a′,s′) is the probabilistic transition for the agent from state s to s′ by choosing action a, and the adversary chooses action a′ in model M, the same for PM′(s,s′,a,a′) is the same.
➁for every state s, the same stage game is associated with s in M and M′, the rewards would be identical.
Approximation Makes Small Difference
Let M and M′ be stochastic games over N states, where M′ is an εN⋅T⋅Rmax approximation of M, then for every state s, agent policy π, adversary policy ρ, we have that
proof::mjtsai1974
- |UM′(s,π,ρ,T)−UM(s,π,ρ,T)|≤ε
, where UM(s,π,ρ,T) is the optimal expected reward with regards to agent policy π, adversary policy ρ, given that they are from state s, make state transitions over T steps.➀begin from
|UM′(s,π,ρ,T)−UM(s,π,ρ,T)|
=∑p|PM′(p)⋅UM′(p)−PM(p)⋅UM(p)|, where p is the path starting from each state s, totally N states, after ranging from s, make state transition of T steps.
➁each such optimal expexted reward of path p in M′ could be expressed as:
∑s∑s′PM′(s,a,a′,s′)⋅UM′(s′), where UM′(s′)=∑Ti=1γi−1⋅VM′(S′i)
➂origin formula in ➀ becomes:
∑s∑s′(PM′(s,a,a′,s′)⋅UM′(s′)-PM(s,a,a′,s′)⋅UM(s′)), where s={s1,s2,…,sN}
➃we are given M′ is an α approximation of M, and
UM′(s′)=∑Ti=1γi−1⋅VM′(s′i)≤∑Ti=1VM′(s′) must hold➄|UM′(s,π,ρ,T)−UM(s,π,ρ,T)|
=∑s∑s′(PM′(s,a,a′,s′)⋅UM′(s′)-PM(s,a,a′,s′)⋅UM(s′))
=∑s∑s′(PM′(s,a,a′,s′)⋅∑Ti=1VM′(s′)-PM(s,a,a′,s′)⋅∑Ti=1VM(s′))
=∑s∑s′(PM′(s,a,a′,s′)-PM(s,a,a′,s′)⋅T⋅(VM′(s′)−VM(s′)))
≤∑s∑s′(εN⋅T⋅Rmax⋅T⋅(VM′(s′)−VM(s′)))
…by given M′ is an εN⋅T⋅Rmax approximation of M
≤∑s∑s′(εN⋅T⋅Rmax⋅T⋅Rmax)
…all optimal expected rewards are bounded by Rmax in this algorithm, so is the difference
≤N⋅εN⋅T⋅Rmax⋅T⋅Rmax=ε
…totally N such states, each distinct s has lots of s′ as its next state, the transitive probability of the same s could then be summed up to 1, thus ∑s∑s′=N