Bellman Operator And Convergence Properties
17 Mar 2019Prologue To The Bellman Operator And Convergence Properties
The Statement of Convergence
[1] The statement➀let B be an operator of contraction mapping, and $Q^{\ast}$=$BQ^{\ast}$ be it’s fixed point.
➁let $Q_{0}$ be a $Q$ function, and define $Q_{t+1}$=$\lbrack B_{t}Q_{t}\rbrack Q_{t}$, then $Q_{t}\rightarrow Q^{\ast}$.Suppose we have been given some sequence of $Q$ functions.
➀it starts off with $Q_{0}$ and the way we’re going to generate the next step from the previous step, is that we are going to have a new kind of operator, $B_{t}$.
➁$B_{t}$ is going to be applied to $Q_{t}$, producing an operator $\lbrack B_{t}Q_{t}\rbrack$ that we then apply to $Q_{t}$, and that’s what we assign $Q_{t+1}$ to be.So, in the context of $Q$ learning, this is essential the $Q$ learning update, there exists 2 different $Q$ functions that are used in the $Q$ learning update:
[2] Notes::mjtsai1974
➀one is the $Q_{t}$ function in $\lbrack B_{t}Q_{t}\rbrack$ that is used to average together, to take care of the fact there is noise in the probabilistic transitions of stochasticity.
➁the other one is the $Q_{t}$ that we are using in the one step look ahead as part of the Bellman operation.➀we can treat the term $\lbrack B_{t}Q_{t}\rbrack$=$B_{t++}$, be noted that it is $t$ plus plus, not $t+1$.
➁$t$ plus plus means that we starts from $t$, with some effort of plus plus to gain some reward maybe. If we are from $t+1$, then we say $\lbrack B_{t+1}Q_{t+1}\rbrack$=$B_{(t+1)++}$.
➂in the final fixed point of $Q^{\ast}$=$BQ^{\ast}$, the $B$ term is actually $B_{\ast}$.
The Convergence Property 1
The cool thing is that this sequence of $Q$ functions, starting from any $Q_{0}$, by keeping applying $Q_{t+1}$=$\lbrack B_{t}Q_{t}\rbrack Q_{t}$, we’re going to approach $Q^{\ast}$, as long as we have certain properties holding on how we define this $B_{t}$.
For all $U_{1}$,$U_{2}$, $S$, $A$:
proof::mjtsai1974
$\;\;\vert(\lbrack B_{t}U_{1}\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U_{2}\rbrack Q^{\ast})(S,A)\vert$
$\;\;\leq(1-\alpha_{t}(S,A))\cdot\vert U_{1}(S,A)-U_{2}(S,A)\vert$
, where the littlecase $t$ says that $B$ is applied onto the $t$-th state’s current transition to next $t+1$ state.$\Rightarrow$Recall that in Bellman Operator Makes Convergence:
$\;\;\lbrack BQ\rbrack(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
➀if we take $Q^{\ast}$=$BQ^{\ast}$, this indicates that
$Q^{\ast}(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{any}(S’,A’)$
, we can transite from any $Q$, the $Q_{any}$ to $Q^{\ast}$
, where $Q_{any}$=$\{Q_{t},Q_{t+1},Q_{t+2},…Q^{\ast}\}$
➁therefore, we have it holds:
$Q_{t+1}(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{t}(S’,A’)$$\Rightarrow\lbrack BQ\rbrack$ is a kind of operator, could be used to apply on $Q$. By $TD(0)$, we can establish it:
$(\lbrack BQ_{T-1}\rbrack Q_{T-1})(S,A)$
=$Q_{T}(S,A)$
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’)-Q_{T-1}(S,A))$
, where uppercase $T$ stands for the $T$-th eposide from state $S$ to state $S'$.$\Rightarrow$Take $(\lbrack BQ_{T-1}\rbrack Q_{T-1})(S,A)$’s first $Q_{T-1}$ term as $U_{1}$ and $U_{2}$, and have the second $Q_{T-1}$ term replaced by $Q^{\ast}$, then:
$\vert(\lbrack BU_{1}\rbrack Q^{\ast})(S,A)-(\lbrack BU_{2}\rbrack Q^{\ast})(S,A)\vert$
=$\vert\alpha_{U_{1}}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P_{U_{1}}(S’\vert S,A)\cdot max_{A’}Q^{\ast}(S’,A’)-Q^{\ast}(S,A))$
-$\alpha_{U_{2}}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P_{U_{2}}(S’\vert S,A)\cdot max_{A’}Q^{\ast}(S’,A’)-Q^{\ast}(S,A))\vert$
=$\vert\alpha_{U_{1}}\cdot D_{U_{1}}-\alpha_{U_{2}}\cdot D_{U_{2}}\vert$
, where the update term $D_{U_{i}}$, could be expressed as:
$R(S,A)$+$\gamma\cdot\sum_{S’}P_{U_{i}}(S’\vert S,A)\cdot max_{A’}Q^{\ast}(S’,A’)$-$Q^{\ast}(S,A))$
, and $i$=$1,2$ in this proof.$\Rightarrow$Continue above deduction:
$\vert\alpha_{U_{1}}\cdot D_{U_{1}}-\alpha_{U_{2}}\cdot D_{U_{2}}\vert$
$\approx\vert(D_{U_{1}}-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}\cdot D_{U_{2}})\vert\cdot\alpha_{U_{1}}$
$\leq\vert(1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}})\cdot (D_{U_{1}}-D_{U_{2}})\vert\cdot\alpha_{U_{1}}$…[A]$\Rightarrow$Why it holds?
Because $(D_{U_{1}}-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}\cdot D_{U_{2}})\leq(1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}})\cdot (D_{U_{1}}-D_{U_{2}})$…[B]
$\Leftrightarrow$
$maxarg(D_{U_{1}})$=$1$
$maxarg(D_{U_{2}})$=$\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}$
$maxarg(D_{U_{1}}-D_{U_{2}})$=$1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}$
, it is the maximum non-expansion.$\Rightarrow$Multiplying both side of [B] with $\alpha_{U_{1}}$ would we get [A], then:
➀take $1-\alpha_{T}$=$(1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}})\cdot\alpha_{U_{1}}$, we just get the inequality proved.
➁take $T$=$t$, for we are evaluating the $t$-th transition by $T$-th eposide, then we get
$(1-\alpha_{t})\cdot (D_{U_{1}}-D_{U_{2}})$…[C]
➂next, add the term $Q_{T-1}(S,A)$
$(1-\alpha_{t})\cdot (Q_{T-1}(S,A)+D_{U_{1}}-(Q_{T-1}(S,A)+D_{U_{2}}))$…[D]
, which is identical to [C].$\Rightarrow$Finally, we establish this property:
$\;\;\vert(\lbrack B_{t}U_{1}\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U_{2}\rbrack Q^{\ast})(S,A)\vert$
$\;\;\leq(1-\alpha_{t}(S,A))\cdot\vert U_{1}(S,A)-U_{2}(S,A)\vert$
, in this proof we’re expressing the distinct $U_{1}$ and $U_{2}$ in terms of temporal difference form:
$Q_{T}(S,A)$
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’)-Q_{T-1}(S,A))$
➀$T$ means the $U_{1}$, $U_{2}$ is evaluating in the $T$-th eposide for their state transition.
➁take $T$=$\{T_{U_{1}}, T_{U_{2}}\}$, if $T_{U_{1}}$=$T_{U_{2}}$, the proof just makes it right; however, the difference won't become larger even if $T_{U_{1}}\neq T_{U_{2}}$, and we keep on applying this Bellman operator oevr and oevr again!!!This proof says we do the Bellman update on $U_{1}$ and $U_{2}$, finally learned the $Q^{\ast}(S,A)$-value.
The Convergence Property 2
For all $Q$,$U$,$S$,$A$:
proof::mjtsai1974
$\;\;\vert(\lbrack B_{t}U\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U\rbrack Q)(S,A)\vert$
$\;\;\leq(\gamma\cdot\alpha_{t}(S,A))\cdot\vert Q^{\ast}(S,A)-Q(S,A)\vert$
, if we hold $U$ fixed, then we get the contraction of $Q$ toward $Q^{\ast}$, by applying $\lbrack B_{t}U\rbrack$ onto $Q^{\ast}$ and $Q$.$\Rightarrow$By taking $U_{1}$=$U_{2}$=$U$, then:
$\vert(\lbrack B_{t}U\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U\rbrack Q)(S,A)\vert$
=$Q^{\ast}(S,A)$
$\;\;$+$\alpha_{t}\cdot(R(S,A)+\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\;\;\cdot max_{A’}Q^{\ast}(S’,A)-Q^{\ast}(S,A))$
-$\lbrack Q(S,A)$
$\;\;$+$\alpha_{t}\cdot(R(S,A)+\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\;\;\cdot max_{A’}Q(S’,A)-Q(S,A))\rbrack$
=$Q^{\ast}(S,A)-Q(S,A)$
+$\alpha_{t}\cdot\lbrack\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\cdot(max_{A’}Q^{\ast}(S’,A)-max_{A’}Q(S’,A))$
$\;\;\;\;$-$(Q^{\ast}(S,A)-Q(S,A))\rbrack$
=$(1-\alpha_{t})(Q^{\ast}(S,A)-Q(S,A))$
+$\alpha_{t}\cdot\lbrack\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\cdot(max_{A’}Q^{\ast}(S’,A)-max_{A’}Q(S’,A))\rbrack$…[C]
$\leq(1-\alpha_{t})(Q^{\ast}(S,A)-Q(S,A))$
$\;\;+\alpha_{t}\cdot\gamma\cdot(max_{A’}Q^{\ast}(S’,A)-max_{A’}Q(S’,A))$…[D]$\Rightarrow$Suppose we are under the definition $Q^{\ast}$=$BQ^{\ast}$ and $Q_{t+1}$=$\lbrack B_{t}Q_{t}\rbrack Q_{t}$, after contiguous running over some horizon:
➀the $Q$-learning just converges to have $Q\rightarrow Q^{\ast}$. The 1st part in [D] could be safely tossed out,it is becoming $0$.
➁the second part in [D] is saved, since it focuses on the $Q^{\ast}(S’,A’)$ and $Q(S’,A’)$ with $A’$ that can maximize the $Q$ value, by tossinig out averaging probabilistic transition from $S$ to $S’$ in [C].$\Rightarrow$We now have the inequality below:
$\leq\alpha_{t}\cdot\gamma\cdot(max_{A’}Q^{\ast}(S’,A’)-max_{A’}Q(S’,A’))$
=$\alpha_{t}\cdot\gamma\cdot max_{A’}(Q^{\ast}(S’,A’)-Q(S’,A’))$…[E]
$\leq\alpha_{t}\cdot\gamma\cdot\vert Q^{\ast}(S,A)-Q(S,A)\vert$…for $S’$ is next state of $S$, we have $Q(S’,A’)\subseteq Q(S,A)$, this inequality just holds.
The Convergence Property 3
This 3rd property guarantees it converges.
➀$\sum\alpha_{t}\rightarrow\infty$
➁$\sum\alpha_{t}^{2}<\infty$
Could be found in Temporal Difference Learning - Part 1.
Addendum
➀Convergence-1, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
➁Convergence-2, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
➂Convergence theorem explained-1, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
➃Convergence theorem explained-2, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)