Meshing The Rewards - Part 1
29 Jun 2019Prologue To Meshing The Rewards - Part 1
Why To Change The Reward Function In MDP?
[1] A brief summaryAs we have a lot study in MDP, we know that state value chnages(or has been accumulated), which is enclosed with every occurrence of state transition:
[2] Because of the state transition
➀it begins with initial configuration.
➁makes value iteration by Bellman equation(or operation), coming out some policy, maybe non-optimal, suboptimal, or even the final optimal.
➂then value improvement after policy iteration and leads to the very next policy improvement.
➃oever and over again, finally to the convergence.We have learned the fact that the state value changes as transiting from current to next state. Should the reward function returns only the fixed constant values? We think it mandatory to reshape the reward function.
[3] The common issue in AIIt would be easy to just turn the reward function into all zeros, then, all policies maximize that reward function, in which case, learning is done. But, this is not the problem we are trying to solve.
This is a common issue in AI.
[4] The targetThe reward function is a way to specify the behavior that is going to get compiled by the learning algorithminto actual behavior.
The semantics we’d like the agent to do by changing the reward function is for the efficiency:
➀speed of computation and experience that the agent needs to learn.
➁space of memory the learning algorithm requires.
➂solvability, infinity versus not-infinity.We want to change the reward function without changing what it's originally optimizing.
Change The Reward Function
[Question]How to change the reward function without changing the optimal policy?
[Answer]We have defined MDP with states, actions, rewards, probability transition and gamma, denoted as $<S,A,R,P,\gamma>$.
By messing with $R$ only, and leaving all the rest uncganged:
➀multiply the reward function by any positive constant
➁shift the reward function by constant
➂non-linear potential-bsed rewards
Multiply $R$ By Positive Constant: Illustration
[Question]$Q(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
[Answer]
This is the Bellman equation we already familar with, and multiply $R(S,A)$ by constant $c$ to get the new reward function, that is $R’(S,A)$=$c\cdot R(S,A)$, where $c>0$, then what is $Q’(S,A)$=?We are asking $Q’(S,A)$ in terms of $Q(S,A)$, let’s do the illustration:
$Q’(S,A)$
=$R’(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q’(S’,A’)$
=$c\cdot R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;c\cdot R(S’,A’)$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$where we have
$Q’(S’,A’)$
=$c\cdot R(S’,A’)$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$in turn, we have
$Q’(S^{"},A^{"})$
=$c\cdot R(S^{"},A^{"})$+$\gamma\cdot\sum_{S^{'''}}P(S^{'''}\vert S^{"},A^{"})\cdot max_{A^{'''}}Q’(S^{'''},A^{'''}))$still more, we have
$Q’(S^{'''},A^{'''})$
=$c\cdot R(S^{'''},A^{'''})$+$\gamma\cdot\sum_{S^{''''}}P(S^{''''}\vert S^{'''},A^{'''})\cdot max_{A^{''''}}Q’(S^{''''},A^{''''}))$therefore,
[Notes::1] Multiply $R$ by zero leaves everything the same
$Q’(S’,A’)$
=$c\cdot R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;c\cdot R(S’,A’)$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$
=$c\cdot (R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;R(S’,A’)+\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}(R(S^{"},A^{"})+…)))$
=$c\cdot Q(S,A)$If you multiply the reward function by zero, each state transition will bring zero as the return, all states’ value will all be zero, you will lose all the information.
[Notes::2] $R$ negative multiply maximize the painIf you multiply the reward function by negative value, then the convergence would end up in maximizing the pain, or by special design in policy iteration to converge in least pain.
Add Scalar To $R$: Illustration
[Question]$Q(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
[Answer]
This is the Bellman equation we already familar with, and add a scalar $c$ to $R(S,A)$ to get the new reward function, that is $R’(S,A)$=$R(S,A)+c$, then what is $Q’(S,A)$=?We are asking $Q’(S,A)$ in terms of $Q(S,A)$, let’s do the illustration:
$Q’(S,A)$
=$R’(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q’(S’,A’)$
=$R(S,A)+c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;R(S’,A’)+c$+$\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}Q’(S^{"},A^{"}))$where we have
$Q’(S^{"},A^{"})$
=$R(S^{"},A^{"})+c$+$\gamma\cdot\sum_{S^{'''}}P(S^{'''}\vert S^{'''},A^{"})\cdot max_{A^{'''}}Q’(S^{'''},A^{'''}))$therefore,
[Notes] Adding scalar makes no impact on original policy
$Q’(S,A)$
=$R(S,A)+c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}($
$\;\;R(S’,A’)+c$+
$\;\;\gamma\cdot\sum_{S^{"}}P(S^{"}\vert S’,A’)\cdot max_{A^{"}}(R(S^{"},A^{"})+c+…))$
=$Q(S,A)$+$c$+$\gamma\cdot c$+$\gamma^{2}\cdot c$++$\gamma^{3}\cdot c$+…
=$Q(S,A)$+$\frac {c}{1-\gamma}$$Q(S,A)$+$\frac {c}{1-\gamma}$
=$R(S,A)$+$c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}(Q(S’,A’)+\frac {c}{1-\gamma})$proof:
$R(S,A)$+$c$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}(Q(S’,A’)+\frac {c}{1-\gamma})$
=$R(S,A)$+$c$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}\frac {c}{1-\gamma}$…➀where the term ➀ could be further expanded:
$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}\frac {c}{1-\gamma}$
=$\gamma\cdot max_{A’}\frac {c}{1-\gamma}$
=$\gamma\cdot \frac {c}{1-\gamma}$
since $max_{A'}$ doesn't depend on this constant $\frac {c}{1-\gamma}$therefore, we have
=$R(S,A)$+$c$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
$\;\;$+$\gamma\cdot \frac {c}{1-\gamma}$in turn, $c$+$\gamma\cdot \frac {c}{1-\gamma}$=$\frac {1}{1-\gamma}$
finally,
=$R(S,A)$+$\frac {c}{1-\gamma}$
$\;\;$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q(S’,A’)$
=$Q(S,A)$+$\frac {c}{1-\gamma}$ is thus proved