mjtsai1974's Dev Blog Welcome to mjt's AI world

Bellman Operator Makes Convergence

Prologue To The Bellman Operator Makes Convergence

By contiguous Bellman update, the value of each state eventually get converged, this happens a lot in reinforcement learning.

Begin By Bellman Operator - B

[1] Beginning

Let B be an operator, or mapping from value function to value function. You give a Q function, the Bellman operator will give you back, possibly, a different Q function. You can treat B a function from Q functions to Q functions.

[2] Definition of Bellman operator

$BQ$=$R(S,A)$+$\gamma\cdot{\textstyle\sum_{S’}}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$

Give B a Q function, the new thing we get out when we apply B onto Q, has the property that at the state action pair $(S,A)$, it is equal to the immediate reward plus the discounted expected value of the next state, $S'$.

So, Q goes in, BQ goes out, treat BQ a new function.

[3] Another way of writing

With Bellman operator - B, we can have below alternatives:
➀$Q^{\ast}$=$BQ^{\ast}$ is another way of writing Bellman equation.
➁$Q_{t}$=$BQ_{t-1}$ is another way of writing value iteration.

Contraction Mapping

[1] Definition

It happens a lot in reinforcement learning. Take B to be an operator,
$\;\;$if for all $F$,$G$ and some $0\leq\gamma<1$,
$\;\;\vert\vert BF-BG\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F-G\vert\vert_{\infty}$,
then B is contraction mapping, where
➀$F$ and $G$ are value functions of Q form.
➁$\vert\vert Q\vert\vert_{\infty}$=$max_{S,A}\vert Q(S,A)\vert$, this notation sometimes called the infinity form, the max norm.
➂$\vert\vert F-G\vert\vert_{\infty}$ this means the biggest difference between $F$ and $G$.

The contraction mapping means whatever largest difference is, we are going to multiply it by something that makes it even smaller. If you apply B onto $F$ and $G$, the distance between the resulting function is even closer together than they started.

[2] Hints

It’s the case that over time as we apply the Bellman operator - B over and over again, the state action pair where their maximum is, might change every time.
➀$\vert\vert F-G\vert\vert_{\infty}$, this maximum norm is computing for the specific value function of $F$ and $G$, where their absolute difference is given the biggest.
➁$\vert\vert BF-BG\vert\vert_{\infty}$ is different from $\vert\vert F-G\vert\vert_{\infty}$ in that there is no reason that $\vert\vert BF-BG\vert\vert_{\infty}$ needs to the the same over and over again.

[3] Properties

If B is an operaton leads to contraction mapping:
➀$F^{\ast}$=$BF^{\ast}$ has a unique solution.
proof:
Suppose we have $F_{1}^{\ast}$=$BF^{\ast}$ and $F_{1}^{\ast}$=$BF^{\ast}$, then $F^{\ast}$=$F_{1}^{\ast}$=$F_{2}^{\ast}$ just holds to hvave the unique $F^{\ast}$.

➁$F_{t}$=$BF_{t-1}$, this leads to $F_{t}\rightarrow F^{\ast}$, value iteration converges.

proof::mjtsai1974

Start from definition, we have:
$\vert\vert BF_{t-1}-BF^{\ast}\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$…[A]
$\Rightarrow \vert\vert F_{t}-F^{\ast}\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$

By mathematics induction, below deduction holds:
$\vert\vert BF_{t}-BF^{\ast}\vert\vert_{\infty}\leq\gamma\cdot \vert\vert F_{t}-F^{\ast}\vert\vert_{\infty}\leq\gamma^{2}\cdot \vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$…[B]

Next, substract [B] from [A], we get:
$\vert\vert BF_{t-1}-BF_{t}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma)\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$
$\Rightarrow\vert\vert BF_{t-1}-BF_{t+1}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma^{2})\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$
$\Rightarrow\vert\vert BF_{t-1}-BF_{t+2}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma^{3})\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$

$\Rightarrow\vert\vert BF_{t-1}-BF_{t+n}\vert\vert_{\infty}\leq\gamma\cdot(1-\gamma^{n+1})\cdot\vert\vert F_{t-1}-F^{\ast}\vert\vert_{\infty}$

We can tell that the first difference term will become stable with respect to each transition in $t$, $t+1$,…, the vaule of $F$ just converges to $F^{\ast}$. As $n\rightarrow\infty$, we have $\gamma\cdot(1-\gamma^{n})\rightarrow \gamma$, for $\gamma<1$just holds.

If $F$ is evaluated from $t-1$, after $n\rightarrow\infty$, it will just converge to $F^{\ast}$ with the acceptable difference no more than $\gamma$ times the difference between the original departuring $F_{t-1}$ and the final expected $F^{\ast}$.

The Bellman Operator Contracts: The Proof

Given $Q_{1}$ and $Q_{2}$, then $\vert\vert BQ_1-BQ_2\vert\vert_\infty\leq\gamma\cdot\vert\vert Q_1-Q_2\vert\vert_\infty$, this says that after we apply B onto $Q_{1}$ and $Q_{2}$, the distance between them is less than or equal to $\gamma$ times the original difference betwen them before we apply it.

So, by applying the Bellman operator, we move these 2 $Q$ functions closer together.

In this section, we are going to prove the definition of contraction mapping by B.

proof:
➀when we apply B onto $Q$:
$\lbrack BQ\rbrack(S,A)$=$R(SA)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
➁the next thing is to unpack the definition of the max norm.
$\vert\vert BQ_{1}-BQ_{2}\vert\vert_{\infty}$
=$max_{S,A}\vert BQ_{1}(S,A)-BQ_{2}(S,A)\vert$
…we are going to maximize over all state action pair
=$max_{S,A}\vert R(S,A)+\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$R(S,A)-\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
=$max_{S,A}\vert \gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
=$max_{S,A}\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot\vert max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
$\leq \gamma\cdot max_{S’}\vert max_{A_{Q_{1}}’}Q_{1}(S’,A_{Q_{1}}’)$
$\;\;$-$max_{A_{Q_{2}}’}Q_{2}(S’,A_{Q_{2}}’)\vert$
=$\gamma\cdot max_{S’,A_{Q_{1}}’,A_{Q_{2}}’}\vert Q_{1}(S’,A_{Q_{1}}’)$-$Q_{2}(S’,A_{Q_{2}}’)\vert$

By tossing out the weight combination, the term $\sum_{S'}P(S'\vert S,A)$ of $Q$ values, we just worry about the $S'$, where the difference is the largest, the weighted average or the convex combination could not be larger than the biggest difference at any $S'$, therefore, no need to keep $max_{S,A}$.

➂$\vert\vert BQ_1-BQ_2\vert\vert_\infty\leq\gamma\cdot\vert\vert Q_1-Q_2\vert\vert_\infty$ is thus proved.

Maximum Is Non-Expansion

For all $f$ and $g$, we have:
$\vert max_{a}f(a)-max_{a}g(a)\vert\leq max_{a}\vert f(a)-g(a)\vert$

proof:
$\Rightarrow$Suppose we have below maximum argument holds:
➀$maxarg_{a}f(a)$=$a_{1}$
➁$maxarg_{a}g(a)$=$a_{2}$
And, we assume $f(a_{1})\geq g(a_{2})$, this assumption owuld not lose generality in this proof, we can also prove with the assumption $f(a_{1})\leq g(a_{2})$.

$\Rightarrow$since $maxarg_{a}g(a)$=$a_{2}$, we have $g(a_{2})\geq g(a_{1})$

$\Rightarrow\vert max_{a}f(a)-max_{a}g(a)\vert$
=$f(a_{1})-g(a_{2})$
$\leq f(a_{1})-g(a_{1})$
=$max_{a}\vert f(a)-g(a)\vert$

Addendum

Temporal Difference Convergence, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
An introduction to Q-Learning: reinforcement learning, ADL