Model-Based RL Algorithm RMAX - Part 4
26 Mar 2020Prologue To Model-Based RL Algorithm RMAX - Part 4
The RMAX Theorem
[Theorem of optimality and convergence]Given below condition:
➀let M be the stochastic game with N states and k actions.
➁let 0<δ<1 and ε>0 be constants, where δ is the error probability and ε is the error term.
➂denote the policy for M whose ε-return mixing time is T by ∏M(ε,T).
➃denote the optimal expected return by such policy by Opt(∏M(ε,T)).Then, with probability of no less than 1−δ the RMAX algorithm will attain an expected return of Opt(∏M(ε,T))−2⋅ε, within a number of steps polynomial in N,k,T,1ε and 1δ.
Notes::mjtsai1974Why the execution of the RMAX algorithm will attain an expected return of Opt(∏M(ε,T))−2⋅ε?
As a result of the fact that the optimal policy is defined on ε-return mixing time of T, the real return of the execution of the RMAX algorithm must be smaller than it, thus we choose it to be −2⋅ε.
Why the RMAX algorithm is polynomial?
[1] Recap on implicit explore or exploit lemmaThe implicit explore or exploit lemma guarantees that the policy generated from the simulated model ML onto the target model M could either leads to α close to optimal reward or explore efficiently with hight probability of at least αRmax, where ML→αM and α=εN⋅T⋅Rmax.
[2] Due to α approximationSince α=εN⋅T⋅Rmax, the T step phases in which we are exploring in the execution of the RMAX algorithm is polynomial in N,T,ε.
[3] Learn over N states and k2 actionsThere are totally N states(or stage games) in model M with k actions for the agent and the adversary, therefore, we have a polynomial number of parameters to learn, say N and k.
[4] The probability to update the unknown informationThe probability the RMAX alrorithm to turn a state from unknown to known or to update statistics information is polynomial in ε,T,N.
[Notes] A brief summaryBase on all of above, by sampling in a large number of times, the implicit explore or exploit lemma guarantees the least probabilistic exploration, and we can ensure that the fail rate in attaining the optimal reward is quiet small, say δ.
The RMAX algorithm claims that we can get near optimal reward with probability 1−δ by sampling a sufficient large number of trials over the same state, which is polynomial in 1δ.
Watch out that each trial contains either exploration or exploitation, next to prove this theorem.
The RMAX Theorem Proof::1
[1] The accuracy in the estimate of transitive probabilityFirst, we’d like to prove that the estimate of transitive probability in the implicit explore or exploit is accurate.
The majority focus on the number of trials in this same state, that is how many times of state transition in this same state for explore or exploit could we believe that the estimated transitive probability is accurate?
[2] By sampling in a large number of trials over the same state
How many number of trials on Gi(in this same state) could we update the transitive statistics of Gi?
➀suppose there exists such transitive probability p on Gi, it could not be guaranteed with probability 1, that is 0≤p≤1.
➁totally, there are N⋅k2 such probabilities, for we have N states, with agent and adversary each having k actions.
➂treat the random variable Xi to be the distinct trial on state Gi, with above denoted transitive probability p to transite from state Gi to Gi′, that is to say
- The value of Xi=1, iff it transits from Gi to Gi′ with probability p; otherwise, the value of Xi=0, iff it just revisits over the same state Gi with probability 1−p.
➃let Zi=Xi−p, then
E[Zi]
=E[Xi−p]
=E[Xi]-p
=1⋅p-p
=0, and |Zi|≤1
- By Chernoff Bounds For Bernoulli Random Variable
We have P(∑ni=1Zi>a)<e−a22⋅n, where Z1,Z2,…,Zn are the n distinct independent trials on state Gi, and a is the error term, such inequality is to ask for the error probability that after n independent trials on state Gi, the total estimate bias is greater than the error term a. This error probability is upper bounded by e−a22⋅n.➄if we perform the test on this state Gi for K1 times, then we have the inequality holds
P(∑K1i=1Zi>K231)<e−K1312
The RMAX paper would like to restrict the total loss(or bias) in the estimate of transitive probability p on Gi over K1 times to be less than K231 and such error probability is upper bounded by e−K1312.
- Inequality symmetry and regularization
If we take Z′i=p−Xi,
then P(∑K1i=1Z′i>K231)<e−K1312,
therefore, P(|∑K1i=1(Xi−p)|>K231)<2⋅e−K1312 by symmetry,
finally, P(|∑K1i=1XiK1−p|>K−131)<2⋅e−K1312.➅back to our departuring point that the RMAX algorithm would like to attain/get close to the optimal reward with probability 1−δ, where δ is the error probability.
- To limit the probabilistic failure of the estimated transitive probability
We want this probabilistic failure smaller than δ, in the paper proof, it was upper bounded by δ3⋅N⋅k2, this would be definitely be smaller than δ, by taking 2⋅e−K1312<δ3⋅N⋅k2.
- To limit the total estimated loss after K1 trials
In the paper proof, we choose the error term to be expressed as K−131<ε2⋅N⋅T⋅Rmax, then the total estimated loss must be less than ε.
➆expand from 2⋅e−K1312<δ3⋅N⋅k2
⇒K1>−8⋅ln3(δ6⋅N⋅k2)
, and we can choose K1>−6⋅ln3(δ6⋅N⋅k2) to narrow down the interval of probabilistic estimation failure.
➇expand from K−131<ε2⋅N⋅T⋅Rmax
⇒K1>(2⋅N⋅T⋅Rmaxε)3
, and K1>(4⋅N⋅T⋅Rmaxε)3 holds for it.Finally, K1=max((4⋅N⋅T⋅Rmaxε)3,−6⋅ln3(δ6⋅N⋅k2))+1, after this K1 trials over the same state, we are confident to to turn a state from unknown to known or to update its statistics information, we believe that the estimated transitive probability is accurate.
The RMAX Theorem Proof::2
The implicit explore or exploit lemma yields a transitive probability of αRmax, we next show that after pure exploration over K2 trials on this same state Gi, we can obtain K1 required visits on Gi.
➀let Xi be the random variable of indicator for each trial, whose value is 1 iff it transits from Gi to Gi′; or 0 iff it justs exploits in the same Gi.
➁let Zi=Xi−αRmax and Z′i=αRmax−Xi, then E[Zi]=0 and E[Z′i]=0 respectively.
- By Chernoff Bounds For Bernoulli Random Variable
P(|∑K2i=1(Xi−αRmax)|>K132)<2⋅e−K−1322
⇒P(|∑K2i=1Xi−K2⋅αRmax|>K132)<2⋅e−K−1322- To limit the error terms
Take 2⋅e−K−1322<δ3⋅N⋅k2
, and take K2⋅αRmax+K132>K1⋅N⋅k2
, for totally we have N⋅k2⋅K1 number of visits to unknown slots. These guarantee that the fail probability less than δ3, which is much smaller than δ.- Why K132 as the error term instead of K232?
The design of this proof is to find the least K2 pure exploration trials, thus guarantees that we could have K1 visits on this same state. Since K1 contains exploration and exploitation, K2≤K1, therefore, the error term is expressed in terms of K132, rather than K232!! We want a smaller K2<K1 and K132<K231.
The RMAX Theorem Proof::3
This section discuss the scenario that we perform T-step iterations without learning, that is to say it exploits over T steps, and the optimal expected reward would be Opt(∏M(T,ε))−ε.
- The actual return may be lower
Since the actual reward after execution of the RMAX algorithm would be some distance from optimal reward, it is at most Opt(∏M(T,ε))−ε, it is lower than Opt(∏M(T,ε))−ε, which is plausible. If we prevent it from exploring, we are at least staying in some local suboptimal.
- We want it to be more closer
To align the design in above section, we choose the real return of exploitation after T-step iterations to be Opt(∏M(T,ε))−32ε, then it is 12ε close to the optimal reward Opt(∏M(T,ε))−ε. The RMAX paper upper bound the failure probability by δ3.➀take Xi to be the actual reward obtained in each trial
➁treat μ as the optimal reward
➂then by yi=μ−XiRmax could we stablize the standard deviation, since it is bounded by Rmax
➃suppose Z=M⋅N⋅T, where M>0, after Z exploitations, the failure probability that the total loss greater than Z23 is upper bounded by e−Z132
- By Chernoff Bounds For Bernoulli Random Variable
P(∑Zi=1Yi>Z23)<e−Z132
⇒P(∑Zi=1μ−XiRmax>Z23)<e−Z132
⇒P(∑Zi=1μ−XiZ⋅Rmax>Z−13)<e−Z132
⇒P(∑Zi=1μ−XiZ>RmaxZ13)<e−Z132➄by choosing M such that
RmaxZ13<ε2 and e−Z132<δ3
⇒Z>(2⋅RMaxε)3 and Z>6⋅ln3(δ3),
We thus have the failure probability less than δ3 for the real return obtained is ε2 far away from the optimal reward.