mjtsai1974's Dev Blog Welcome to mjt's AI world

Temporal Difference In Q Form

Prologue To The Temporal Difference In Q Form

Not just learn the value, TD(Temporal Difference) rule with action control in Q form would learn the values of different actions we might take as a way to figure out the most optimal(maximum valued) behavior in each state. By given enough data, enough time, TD with action control would just do the right thing and eventually converge.

Glance At The Q-Learning

[1] Overview

Here I am just to have a simplistic overview of Q-learning:
Q-learning is a value-based reinforcement learning algorithm, which is used to find the most optimal(best) policy to choose the action that can maximize the value of each state by using a Q function.
➁the Q function takes state and action as input parameters.
$Q^{\pi}(S_{t},A_{t})$=$E\lbrack R_{t+1}+\gamma\cdot R_{t+2}+\gamma^{2}\cdot R_{t+3}+…\vert S_{t},A_{t}\rbrack$

The $\pi$ indicates the $Q$ function asks for policy could be expressed in the form of expected discounted cumulative reward, given $S_{t}$ and $A_{t}$.
➂take action $A$ to transite from state $S$ to $S’$
$Q_{T}(S,A)$
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$+$\gamma\cdot max_{A’}Q_{T-1}(S’,A’)$-$Q_{T-1}(S,A))$…[A]

By repeating above Bellman equation in Q form, the Q value will finally get converged, usually denoted as $Q^{*}(S,A)$, and it’s the policy for you to take action $A$ when you are in state $S$ to get the maximum value.

[2] Notes::mjtsai1974

As to equation [A], it could be further expanded:
$Q_{T}(S,A)$
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$+$\gamma\cdot max_{A’}Q_{T-1}(S’,A’)$-$Q_{T-1}(S,A))$…[A]
=$Q_{T-1}(S,A)$+$\alpha_{T}\cdot(R(S,A)$+$\gamma\cdot (\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’))$-$Q_{T-1}(S,A))$…[B]

Say it [B], where the term $(\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’))$ in [B] could be treated as $max_{A’}Q_{T-1}(S’,A’)$ in [A].

➀if we know the probability transition from $S$ to $S’$, its associated immediate reward, we can take advantage of [B].
➁if we neither know probability distribution of state $S$, nor the immediate reward, we can just use [A], take only the Q value learned in last eposide.

See, the Q form is just expressed in terms of temporal difference.

The Q form is quiet usefule in the reinforcement learning, when we are under the condition that we know nothing about the immediate reward and the probability distribution from current state to next state is uncertain, it is converinet to use this Q value as the experience.

Bellman Equation: With versus Without Actions

[1] Without actions

no action in value function:
$V(S)$=$R(S)$+$\gamma\cdot\sum_{S’}P(S’\vert S, A)\cdot V(S’)$
The value function $V(S)$ of state $S$ takes no action control into consideration, the term $P(S’\vert S, A)$ indicates the we are in stochastic MDP model, the outcome of action execution is uncertain.
➁$TD(0)$ isomorphism:
$V_{T}(S_{t-1})$=$V_{T}(S_{t-1})$+$\alpha_{T}\cdot (r_{t}+\gamma\cdot V_{T-1}(S_{t})-V_{T-1}(S_{t-1}))$
By using $TD(0)$ to update the value of $S_{t-1}$ when transiting from $S_{t-1}$ to $S_{t}$ in one eposide.

[2] With actions

➀the Q form value function:
$Q(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S, A)\cdot max_{A’}Q(S’,A’)$
The value function now takes action into concern, after transiting from $S$ to $S’$, targeting at the action $A’$ chose by $S’$ to maximize value of $S’$, the term $Q(S’,A’)$.
➁$TD(0)$ isomorphism:
$Q_{T}(S_{t-1},A_{t-1})$=$Q_{T-1}(S_{t-1},A_{t-1})$+$\alpha_{T}\cdot (r_{t}+\gamma\cdot max_{A_{t}}Q_{T-1}(S_{t},A_{t})-Q_{T-1}(S_{t-1},A_{t-1}))$
By using $TD(0)$ to update the value of $S_{t-1}$ in one eposide, be noted the term $max_{A_{t}}Q_{T-1}(S_{t},A_{t})$ aims at the action $A_{t}$ chose by $S_{t}$ that can maximize the value of $S_{t}$, the term $Q_{T-1}(S_{t},A_{t})$. If such $A_t$ exists and can maximize $Q_{T}(S_{t},A_{t})$ across all eposide $T$, then it is just the policy to choose $A_{t}$.

Approximations

[1] We know the whole MDP model

If we know the whole MDP model, then, synchronuous update, simply by applying Bellman equation:
$Q_{T}(S,A)$=$R(S,A)$+$\gamma\cdot\sum_{S’}P(S’\vert S,A)\cdot max_{A’}Q_{T-1}(S’,A’)$
➀as to the term $R(S,A)$, this is the immediate reward, $A$ is just the best action taken by state $S$ to transite to $S’$. Because the whole MDP model is known, we know this $R(S,A)$.
➁the synchronuous update is by suming over all probabilistic transitions from $S$ to $S’$, given action $A$.
➂we take the old $Q_{T-1}(S’,A’)$ value estimated in eposide $T-1$ and do the Bellman update in specific state, action pair or particular state connection pair.

[2] the whole MDP model is not clear, but $Q_{T-1}^{\ast}(S_{t-1},A_{t-1})$ is known

When we transite from $S_{t-1}$ by action $A_{t-1}$ to $S_{t}$, if we know $Q_{T-1}^{\ast}(S_{t},A_{t})$, we could use it to learn $Q_{T}^{*}(S_{t-1},A_{t-1})$ by sampling asynchronuously update.
$Q_{T}(S_{t-1},A_{t-1})$=$Q_{T-1}(S_{t-1},A_{t-1})$+$\alpha_{T}\cdot(r_{t}+\gamma\cdot max_{A_{t}}Q_{T-1}^{\ast}(S_{t},A_{t})-Q_{T-1}(S_{t-1},A_{t-1}))$
Where we can take $Q^{\ast}(S_{t-1},A_{t-1})$=$Q_{T}(S_{t-1},A_{t-1})$, the $Q_{T}(S_{t-1},A_{t-1})$ learned would eventually converge to $Q^{\ast}(S_{t-1},A_{t-1})$.

[3] Inspiration::mjtsai1974

The $Q^{\ast}$ value of $(S,A)$could be estimated from given data, suppose we are evaluating $Q^{\ast}(S_{t-1},A_{t-1})$, here are the point and concern:
➀should we estimate $Q^{\ast}(S_{t-1},A_{t-1})$ base on the whole given data and use it for $Q_{t}^{\ast}(S_{t-1},A_{t-1})$, where $t\in all\;T$? Or,
➁should we estimate $Q_{T-1}^{\ast}(S_{t-1},A_{t-1})$ on sample from eposide $1$ to $T-1$, when we are now evaluating in eposide $T$, the $T$-th eposide?
➂if we choose ➀, would we just ignore the partial observe mechanism in this learning process? Or,
➃first learn by ➀, then we just learn by ➁, after the end is reached, we do the estimate for each distinct $Q(S,A)$ pair, repeat each iteration of learning in ➁ with respect to each $Q(S,A)$ in each eposide of transition from $S_{t-1}$ to $S_{t}$, finally do the adjustment in each learning eposide in ➁, by adjust learning rate or discount factor!!!
➄succeed to ➃, compare with the learning result in ➀, until the norm of the difference has been brought to the minimum, if we treat the learning result in ➀ and ➁ as vectors.

By this scenario of learning by ➀ then by ➁ to get the learning rate and discounted factor converged, we assume the current given data is the complete data set for the complete MDP model we know, so far; and the very next time, something unobservable in last session has been observed in this session, just triggers the whole learning process again in the very next session!!!

Addendum

Temporal Difference Convergence, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
An introduction to Q-Learning: reinforcement learning, ADL