Meshing The Rewards - Part 1

29 Jun 2019

Prologue To Meshing The Rewards - Part 1

In order to avoid the local suboptimal endless looping, the suggestive approach is to mesh the reward funtion without impact on the original optimal policy. The reward function should contain not only the fixed return of constant, as we are losing benifit of value obtained in current state, when transiting from current state to next state.

Why To Change The Reward Function In MDP?

[1] A brief summary
As we have a lot study in MDP, we know that state value chnages(or has been accumulated), which is enclosed with every occurrence of state transition:
➀it begins with initial configuration.
➁makes value iteration by Bellman equation(or operation), coming out some policy, maybe non-optimal, suboptimal, or even the final optimal.
➂then value improvement after policy iteration and leads to the very next policy improvement.
➃oever and over again, finally to the convergence.
[2] Because of the state transition
We have learned the fact that the state value changes as transiting from current to next state. Should the reward function returns only the fixed constant values? We think it mandatory to reshape the reward function.
[3] The common issue in AI
It would be easy to just turn the reward function into all zeros, then, all policies maximize that reward function, in which case, learning is done. But, this is not the problem we are trying to solve.

This is a common issue in AI.
[4] The target
The reward function is a way to specify the behavior that is going to get compiled by the learning algorithminto actual behavior.

The semantics we’d like the agent to do by changing the reward function is for the efficiency:
➀speed of computation and experience that the agent needs to learn.
➁space of memory the learning algorithm requires.
➂solvability, infinity versus not-infinity.

We want to change the reward function without changing what it's originally optimizing.

Change The Reward Function

[Question]
How to change the reward function without changing the optimal policy?
[Answer]
We have defined MDP with states, actions, rewards, probability transition and gamma, denoted as $<S,A,R,P,\gamma>$.

By messing with $R$ only, and leaving all the rest uncganged:
➀multiply the reward function by any positive constant
➁shift the reward function by constant
➂non-linear potential-bsed rewards

Multiply $R$ By Positive Constant: Illustration

[Question]
$Q(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
This is the Bellman equation we already familar with, and multiply $R(S,A)$ by constant $c$ to get the new reward function, that is $R’(S,A)$=$c\cdot R(S,A)$, where $c>0$, then what is $Q’(S,A)$=?
[Answer]
We are asking $Q’(S,A)$ in terms of $Q(S,A)$, let’s do the illustration:
$Q’(S,A)$
=$R’(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q’(S’,A’)$
=$c\cdot R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;c\cdot R(S’,A’)$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$

where we have
$Q’(S’,A’)$
=$c\cdot R(S’,A’)$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$

in turn, we have
$Q’(S^{"},A^{"})$
=$c\cdot R(S^{"},A^{"})$+$\gamma\cdot\sum_{S^{'''}}P(S^{'''}\vert S^{"},A^{"})\cdot max_{A^{'''}}Q’(S^{'''},A^{'''}))$

still more, we have
$Q’(S^{'''},A^{'''})$
=$c\cdot R(S^{'''},A^{'''})$+$\gamma\cdot\sum_{S^{''''}}P(S^{''''}\vert S^{'''},A^{'''})\cdot max_{A^{''''}}Q’(S^{''''},A^{''''}))$

therefore,
$Q’(S’,A’)$
=$c\cdot R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;c\cdot R(S’,A’)$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$
=$c\cdot (R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;R(S’,A’)+\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}(R(S^{"},A^{"})+…)))$
=$c\cdot Q(S,A)$
[Notes::1] Multiply $R$ by zero leaves everything the same
If you multiply the reward function by zero, each state transition will bring zero as the return, all states’ value will all be zero, you will lose all the information.
[Notes::2] $R$ negative multiply maximize the pain
If you multiply the reward function by negative value, then the convergence would end up in maximizing the pain, or by special design in policy iteration to converge in least pain.

Add Scalar To $R$: Illustration

[Question]
$Q(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
This is the Bellman equation we already familar with, and add a scalar $c$ to $R(S,A)$ to get the new reward function, that is $R’(S,A)$=$R(S,A)+c$, then what is $Q’(S,A)$=?
[Answer]
We are asking $Q’(S,A)$ in terms of $Q(S,A)$, let’s do the illustration:
$Q’(S,A)$
=$R’(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q’(S’,A’)$
=$R(S,A)+c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;R(S’,A’)+c$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$

where we have
$Q’(S^{"},A^{"})$
=$R(S^{"},A^{"})+c$+$\gamma\cdot\sum_{S^{'''}}P(S^{'''}\vert S^{'''},A^{"})\cdot max_{A^{'''}}Q’(S^{'''},A^{'''}))$

therefore,
$Q’(S,A)$
=$R(S,A)+c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;R(S’,A’)+c$+
$\;\;\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}(R(S^{"},A^{"})+c+…))$
=$Q(S,A)$+$c$+$\gamma\cdot c$+$\gamma^{2}\cdot c$++$\gamma^{3}\cdot c$+…
=$Q(S,A)$+$\frac {c}{1-\gamma}$
[Notes] Adding scalar makes no impact on original policy
$Q(S,A)$+$\frac {c}{1-\gamma}$
=$R(S,A)$+$c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}(Q(S’,A’)+\frac {c}{1-\gamma})$

proof:
$R(S,A)$+$c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}(Q(S’,A’)+\frac {c}{1-\gamma})$
=$R(S,A)$+$c$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}\frac {c}{1-\gamma}$…➀

where the term ➀ could be further expanded:
$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}\frac {c}{1-\gamma}$
=$\gamma\cdot max_{A’}\frac {c}{1-\gamma}$
=$\gamma\cdot \frac {c}{1-\gamma}$
since $max_{A'}$ doesn't depend on this constant $\frac {c}{1-\gamma}$

therefore, we have
=$R(S,A)$+$c$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
$\;\;$+$\gamma\cdot \frac {c}{1-\gamma}$

in turn, $c$+$\gamma\cdot \frac {c}{1-\gamma}$=$\frac {1}{1-\gamma}$

finally,
=$R(S,A)$+$\frac {c}{1-\gamma}$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
=$Q(S,A)$+$\frac {c}{1-\gamma}$ is thus proved

Addendum

➀Meshing with rewards, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)

mjtsai1974's Dev Blog Welcome to mjt's AI world

Meshing The Rewards - Part 1

Prologue To Meshing The Rewards - Part 1

Why To Change The Reward Function In MDP?

Change The Reward Function

Multiply $R$ By Positive Constant: Illustration

Add Scalar To $R$: Illustration

Addendum

Related Posts

Partial Observable Markov Decision Process - Part 4 04 Dec 2020

Partial Observable Markov Decision Process - Part 3 10 Oct 2020

Partial Observable Markov Decision Process - Part 2 13 Aug 2020