mjtsai1974's Dev Blog Welcome to mjt's AI world

Bellman Operator And Convergence Properties

Prologue To The Bellman Operator And Convergence Properties

The Bellman operator of contraction mapping makes the statement of convergence concrete and come out the major properties.

The Statement of Convergence

[1] The statement

➀let B be an operator of contraction mapping, and $Q^{\ast}$=$BQ^{\ast}$ be it’s fixed point.
➁let $Q_{0}$ be a $Q$ function, and define $Q_{t+1}$=$\lbrack B_{t}Q_{t}\rbrack Q_{t}$, then $Q_{t}\rightarrow Q^{\ast}$.

Suppose we have been given some sequence of $Q$ functions.
➀it starts off with $Q_{0}$ and the way we’re going to generate the next step from the previous step, is that we are going to have a new kind of operator, $B_{t}$.
➁$B_{t}$ is going to be applied to $Q_{t}$, producing an operator $\lbrack B_{t}Q_{t}\rbrack$ that we then apply to $Q_{t}$, and that’s what we assign $Q_{t+1}$ to be.

So, in the context of $Q$ learning, this is essential the $Q$ learning update, there exists 2 different $Q$ functions that are used in the $Q$ learning update:
➀one is the $Q_{t}$ function in $\lbrack B_{t}Q_{t}\rbrack$ that is used to average together, to take care of the fact there is noise in the probabilistic transitions of stochasticity.
➁the other one is the $Q_{t}$ that we are using in the one step look ahead as part of the Bellman operation.

[2] Notes::mjtsai1974

➀we can treat the term $\lbrack B_{t}Q_{t}\rbrack$=$B_{t++}$, be noted that it is $t$ plus plus, not $t+1$.
$t$ plus plus means that we starts from $t$, with some effort of plus plus to gain some reward maybe. If we are from $t+1$, then we say $\lbrack B_{t+1}Q_{t+1}\rbrack$=$B_{(t+1)++}$.
➂in the final fixed point of $Q^{\ast}$=$BQ^{\ast}$, the $B$ term is actually $B_{\ast}$.

The Convergence Property 1

The cool thing is that this sequence of $Q$ functions, starting from any $Q_{0}$, by keeping applying $Q_{t+1}$=$\lbrack B_{t}Q_{t}\rbrack Q_{t}$, we’re going to approach $Q^{\ast}$, as long as we have certain properties holding on how we define this $B_{t}$.

For all $U_{1}$,$U_{2}$, $S$, $A$:
$\;\;\vert(\lbrack B_{t}U_{1}\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U_{2}\rbrack Q^{\ast})(S,A)\vert$
$\;\;\leq(1-\alpha_{t}(S,A))\cdot\vert U_{1}(S,A)-U_{2}(S,A)\vert$
, where the littlecase $t$ says that $B$ is applied onto the $t$-th state’s current transition to next $t+1$ state.

proof::mjtsai1974

$\Rightarrow$Recall that in Bellman Operator Makes Convergence:
$\;\;\lbrack BQ\rbrack(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
➀if we take $Q^{\ast}$=$BQ^{\ast}$, this indicates that
$Q^{\ast}(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{any}(S’,A’)$
, we can transite from any $Q$, the $Q_{any}$ to $Q^{\ast}$
, where $Q_{any}$=$\{Q_{t},Q_{t+1},Q_{t+2},…Q^{\ast}\}$
➁therefore, we have it holds:
$Q_{t+1}(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{t}(S’,A’)$

$\Rightarrow\lbrack BQ\rbrack$ is a kind of operator, could be used to apply on $Q$. By $TD(0)$, we can establish it:
$(\lbrack BQ_{T-1}\rbrack Q_{T-1})(S,A)$
=$Q_{T}(S,A)$
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’)-Q_{T-1}(S,A))$
, where uppercase $T$ stands for the $T$-th eposide from state $S$ to state $S'$.

$\Rightarrow$Take $(\lbrack BQ_{T-1}\rbrack Q_{T-1})(S,A)$’s first $Q_{T-1}$ term as $U_{1}$ and $U_{2}$, and have the second $Q_{T-1}$ term replaced by $Q^{\ast}$, then:
$\vert(\lbrack BU_{1}\rbrack Q^{\ast})(S,A)-(\lbrack BU_{2}\rbrack Q^{\ast})(S,A)\vert$
=$\vert\alpha_{U_{1}}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P_{U_{1}}(S’\vert S,A)\cdot max_{A’}Q^{\ast}(S’,A’)-Q^{\ast}(S,A))$
-$\alpha_{U_{2}}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P_{U_{2}}(S’\vert S,A)\cdot max_{A’}Q^{\ast}(S’,A’)-Q^{\ast}(S,A))\vert$
=$\vert\alpha_{U_{1}}\cdot D_{U_{1}}-\alpha_{U_{2}}\cdot D_{U_{2}}\vert$
, where the update term $D_{U_{i}}$, could be expressed as:
$R(S,A)$+$\gamma\cdot\sum_{S’}P_{U_{i}}(S’\vert S,A)\cdot max_{A’}Q^{\ast}(S’,A’)$-$Q^{\ast}(S,A))$
, and $i$=$1,2$ in this proof.

$\Rightarrow$Continue above deduction:
$\vert\alpha_{U_{1}}\cdot D_{U_{1}}-\alpha_{U_{2}}\cdot D_{U_{2}}\vert$
$\approx\vert(D_{U_{1}}-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}\cdot D_{U_{2}})\vert\cdot\alpha_{U_{1}}$
$\leq\vert(1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}})\cdot (D_{U_{1}}-D_{U_{2}})\vert\cdot\alpha_{U_{1}}$…[A]

$\Rightarrow$Why it holds?
Because $(D_{U_{1}}-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}\cdot D_{U_{2}})\leq(1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}})\cdot (D_{U_{1}}-D_{U_{2}})$…[B]
$\Leftrightarrow$
$maxarg(D_{U_{1}})$=$1$
$maxarg(D_{U_{2}})$=$\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}$
$maxarg(D_{U_{1}}-D_{U_{2}})$=$1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}}$
, it is the maximum non-expansion.

$\Rightarrow$Multiplying both side of [B] with $\alpha_{U_{1}}$ would we get [A], then:
➀take $1-\alpha_{T}$=$(1-\frac {\alpha_{U_{2}}}{\alpha_{U_{1}}})\cdot\alpha_{U_{1}}$, we just get the inequality proved.
➁take $T$=$t$, for we are evaluating the $t$-th transition by $T$-th eposide, then we get
$(1-\alpha_{t})\cdot (D_{U_{1}}-D_{U_{2}})$…[C]
➂next, add the term $Q_{T-1}(S,A)$
$(1-\alpha_{t})\cdot (Q_{T-1}(S,A)+D_{U_{1}}-(Q_{T-1}(S,A)+D_{U_{2}}))$…[D]
, which is identical to [C].

$\Rightarrow$Finally, we establish this property:
$\;\;\vert(\lbrack B_{t}U_{1}\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U_{2}\rbrack Q^{\ast})(S,A)\vert$
$\;\;\leq(1-\alpha_{t}(S,A))\cdot\vert U_{1}(S,A)-U_{2}(S,A)\vert$
, in this proof we’re expressing the distinct $U_{1}$ and $U_{2}$ in terms of temporal difference form:
$Q_{T}(S,A)$
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’)-Q_{T-1}(S,A))$
$T$ means the $U_{1}$, $U_{2}$ is evaluating in the $T$-th eposide for their state transition.
➁take $T$=$\{T_{U_{1}}, T_{U_{2}}\}$, if $T_{U_{1}}$=$T_{U_{2}}$, the proof just makes it right; however, the difference won't become larger even if $T_{U_{1}}\neq T_{U_{2}}$, and we keep on applying this Bellman operator oevr and oevr again!!!

This proof says we do the Bellman update on $U_{1}$ and $U_{2}$, finally learned the $Q^{\ast}(S,A)$-value.

The Convergence Property 2

For all $Q$,$U$,$S$,$A$:
$\;\;\vert(\lbrack B_{t}U\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U\rbrack Q)(S,A)\vert$
$\;\;\leq(\gamma\cdot\alpha_{t}(S,A))\cdot\vert Q^{\ast}(S,A)-Q(S,A)\vert$
, if we hold $U$ fixed, then we get the contraction of $Q$ toward $Q^{\ast}$, by applying $\lbrack B_{t}U\rbrack$ onto $Q^{\ast}$ and $Q$.

proof::mjtsai1974

$\Rightarrow$By taking $U_{1}$=$U_{2}$=$U$, then:
$\vert(\lbrack B_{t}U\rbrack Q^{\ast})(S,A)-(\lbrack B_{t}U\rbrack Q)(S,A)\vert$
=$Q^{\ast}(S,A)$
$\;\;$+$\alpha_{t}\cdot(R(S,A)+\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\;\;\cdot max_{A’}Q^{\ast}(S’,A)-Q^{\ast}(S,A))$
-$\lbrack Q(S,A)$
$\;\;$+$\alpha_{t}\cdot(R(S,A)+\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\;\;\cdot max_{A’}Q(S’,A)-Q(S,A))\rbrack$
=$Q^{\ast}(S,A)-Q(S,A)$
+$\alpha_{t}\cdot\lbrack\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\cdot(max_{A’}Q^{\ast}(S’,A)-max_{A’}Q(S’,A))$
$\;\;\;\;$-$(Q^{\ast}(S,A)-Q(S,A))\rbrack$
=$(1-\alpha_{t})(Q^{\ast}(S,A)-Q(S,A))$
+$\alpha_{t}\cdot\lbrack\gamma\cdot\sum_{S’}P(S’\vert S,A)$
$\;\;\cdot(max_{A’}Q^{\ast}(S’,A)-max_{A’}Q(S’,A))\rbrack$…[C]
$\leq(1-\alpha_{t})(Q^{\ast}(S,A)-Q(S,A))$
$\;\;+\alpha_{t}\cdot\gamma\cdot(max_{A’}Q^{\ast}(S’,A)-max_{A’}Q(S’,A))$…[D]

$\Rightarrow$Suppose we are under the definition $Q^{\ast}$=$BQ^{\ast}$ and $Q_{t+1}$=$\lbrack B_{t}Q_{t}\rbrack Q_{t}$, after contiguous running over some horizon:
➀the $Q$-learning just converges to have $Q\rightarrow Q^{\ast}$. The 1st part in [D] could be safely tossed out,it is becoming $0$.
➁the second part in [D] is saved, since it focuses on the $Q^{\ast}(S’,A’)$ and $Q(S’,A’)$ with $A’$ that can maximize the $Q$ value, by tossinig out averaging probabilistic transition from $S$ to $S’$ in [C].

$\Rightarrow$We now have the inequality below:
$\leq\alpha_{t}\cdot\gamma\cdot(max_{A’}Q^{\ast}(S’,A’)-max_{A’}Q(S’,A’))$
=$\alpha_{t}\cdot\gamma\cdot max_{A’}(Q^{\ast}(S’,A’)-Q(S’,A’))$…[E]
$\leq\alpha_{t}\cdot\gamma\cdot\vert Q^{\ast}(S,A)-Q(S,A)\vert$…for $S’$ is next state of $S$, we have $Q(S’,A’)\subseteq Q(S,A)$, this inequality just holds.

The Convergence Property 3

This 3rd property guarantees it converges.
➀$\sum\alpha_{t}\rightarrow\infty$
➁$\sum\alpha_{t}^{2}<\infty$
Could be found in Temporal Difference Learning - Part 1.

Addendum

Convergence-1, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
Convergence-2, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
Convergence theorem explained-1, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
Convergence theorem explained-2, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)