Bellman Operator Makes Convergence

05 Mar 2019

Prologue To The Bellman Operator Makes Convergence

By contiguous Bellman update, the value of each state eventually get converged, this happens a lot in reinforcement learning.

Begin By Bellman Operator - B

[1] Beginning
Let B be an operator, or mapping from value function to value function. You give a Q function, the Bellman operator will give you back, possibly, a different Q function. You can treat B a function from Q functions to Q functions.
[2] Definition of Bellman operator
$BQ$=$R(S,A)$+$\gamma\cdot{\textstyle\sum_{S’}}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$

Give B a Q function, the new thing we get out when we apply B onto Q, has the property that at the state action pair $(S,A)$, it is equal to the immediate reward plus the discounted expected value of the next state, $S'$.

So, Q goes in, BQ goes out, treat BQ a new function.
[3] Another way of writing
With Bellman operator - B, we can have below alternatives:
➀$Q^{\ast}$=$BQ^{\ast}$ is another way of writing Bellman equation.
➁$Q_{t}$=$BQ_{t-1}$ is another way of writing value iteration.

Contraction Mapping

[1] Definition
It happens a lot in reinforcement learning. Take B to be an operator,
$\;\;$if for all $F$,$G$ and some $0\leq\gamma<1$,
$\;\;\vert\vert BF-BG\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F-G\vert\vert_{\infty}$,
then B is contraction mapping, where
➀$F$ and $G$ are value functions of Q form.
➁$\vert\vert Q\vert\vert_{\infty}$=$max_{S,A}\vert Q(S,A)\vert$, this notation sometimes called the infinity form, the max norm.
➂$\vert\vert F-G\vert\vert_{\infty}$ this means the biggest difference between $F$ and $G$.

The contraction mapping means whatever largest difference is, we are going to multiply it by something that makes it even smaller. If you apply B onto $F$ and $G$, the distance between the resulting function is even closer together than they started.
[2] Hints
It’s the case that over time as we apply the Bellman operator - B over and over again, the state action pair where their maximum is, might change every time.
➀$\vert\vert F-G\vert\vert_{\infty}$, this maximum norm is computing for the specific value function of $F$ and $G$, where their absolute difference is given the biggest.
➁$\vert\vert BF-BG\vert\vert_{\infty}$ is different from $\vert\vert F-G\vert\vert_{\infty}$ in that there is no reason that $\vert\vert BF-BG\vert\vert_{\infty}$ needs to the the same over and over again.
[3] Properties
If B is an operaton leads to contraction mapping:
➀$F^{\ast}$=$BF^{\ast}$ has a unique solution.
proof:
Suppose we have $F_{1}^{\ast}$=$BF^{\ast}$ and $F_{1}^{\ast}$=$BF^{\ast}$, then $F^{\ast}$=$F_{1}^{\ast}$=$F_{2}^{\ast}$ just holds to hvave the unique $F^{\ast}$.

➁$F_{t}$=$BF_{t-1}$, this leads to $F_{t}\rightarrow F^{\ast}$, value iteration converges.
proof::mjtsai1974
Start from definition, we have:
$\vert\vert BF_{t-1}-BF^{\ast}\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$…[A]
$\Rightarrow \vert\vert F_{t}-F^{\ast}\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$

By mathematics induction, below deduction holds:
$\vert\vert BF_{t}-BF^{\ast}\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F_{t}-F^{\ast}\vert\vert_{\infty}\leq\gamma^{2}\cdot \vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$…[B]

Next, substract [B] from [A], we get:
$\vert\vert BF_{t-1}-BF_{t}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma)\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$
$\Rightarrow\vert\vert BF_{t-1}-BF_{t+1}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma^{2})\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$
$\Rightarrow\vert\vert BF_{t-1}-BF_{t+2}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma^{3})\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$
…
$\Rightarrow\vert\vert BF_{t-1}-BF_{t+n}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma^{n+1})\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$

We can tell that the first difference term will become stable with respect to each transition in $t$, $t+1$,…, the vaule of $F$ just converges to $F^{\ast}$. As $n\rightarrow\infty$, we have $\gamma\cdot(1-\gamma^{n})\rightarrow \gamma$, for $\gamma<1$just holds.

If $F$ is evaluated from $t-1$, after $n\rightarrow\infty$, it will just converge to $F^{\ast}$ with the acceptable difference no more than $\gamma$ times the difference between the original departuring $F_{t-1}$ and the final expected $F^{\ast}$.

The Bellman Operator Contracts: The Proof

Given $Q_{1}$ and $Q_{2}$, then $\vert\vert BQ_1-BQ_2\vert\vert_\infty\leq\gamma\cdot\vert\vert Q_1-Q_2\vert\vert_\infty$, this says that after we apply B onto $Q_{1}$ and $Q_{2}$, the distance between them is less than or equal to $\gamma$ times the original difference betwen them before we apply it.

So, by applying the Bellman operator, we move these 2 $Q$ functions closer together.

In this section, we are going to prove the definition of contraction mapping by B.

proof:
➀when we apply B onto $Q$:
$\lbrack BQ\rbrack(S,A)$=$R(SA)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
➁the next thing is to unpack the definition of the max norm.
$\vert\vert BQ_{1}-BQ_{2}\vert\vert_{\infty}$
=$max_{S,A}\vert BQ_{1}(S,A)-BQ_{2}(S,A)\vert$
…we are going to maximize over all state action pair
=$max_{S,A}\vert R(S,A)+\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$R(S,A)-\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
=$max_{S,A}\vert \gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
=$max_{S,A}\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot\vert max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
$\leq \gamma\cdot max_{S’}\vert max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
=$\gamma\cdot max_{S’,A_{Q_{1}}’,A_{Q_{2}}’}\vert Q_{1}(S’,A_{Q_{1}}’)$-$Q_{2}(S’,A_{Q_{2}}’)\vert$

By tossing out the weight combination, the term $\sum_{S'}P(S'\vert S,A)$ of $Q$ values, we just worry about the $S'$, where the difference is the largest, the weighted average or the convex combination could not be larger than the biggest difference at any $S'$, therefore, no need to keep $max_{S,A}$.

➂$\vert\vert BQ_1-BQ_2\vert\vert_\infty\leq\gamma\cdot\vert\vert Q_1-Q_2\vert\vert_\infty$ is thus proved.

Maximum Is Non-Expansion

For all $f$ and $g$, we have:
$\vert max_{a}f(a)-max_{a}g(a)\vert\leq max_{a}\vert f(a)-g(a)\vert$

proof:
$\Rightarrow$Suppose we have below maximum argument holds:
➀$maxarg_{a}f(a)$=$a_{1}$
➁$maxarg_{a}g(a)$=$a_{2}$
And, we assume $f(a_{1})\geq g(a_{2})$, this assumption owuld not lose generality in this proof, we can also prove with the assumption $f(a_{1})\leq g(a_{2})$.

$\Rightarrow$since $maxarg_{a}g(a)$=$a_{2}$, we have $g(a_{2})\geq g(a_{1})$

$\Rightarrow\vert max_{a}f(a)-max_{a}g(a)\vert$
=$f(a_{1})-g(a_{2})$
$\leq f(a_{1})-g(a_{1})$
=$max_{a}\vert f(a)-g(a)\vert$

Addendum

➀Temporal Difference Convergence, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
➁An introduction to Q-Learning: reinforcement learning, ADL

mjtsai1974's Dev Blog Welcome to mjt's AI world

Bellman Operator Makes Convergence

Prologue To The Bellman Operator Makes Convergence

Begin By Bellman Operator - B

Contraction Mapping

The Bellman Operator Contracts: The Proof

Maximum Is Non-Expansion

Addendum

Related Posts

Partial Observable Markov Decision Process - Part 4 04 Dec 2020

Partial Observable Markov Decision Process - Part 3 10 Oct 2020

Partial Observable Markov Decision Process - Part 2 13 Aug 2020