mjtsai1974's Dev Blog Welcome to mjt's AI world

Temporal Difference Learning - Part 2

Prologue To The Temporal Difference Learning - Part 2

This is the part 2, new introduce TD(0), continue with advantages and cons of TD(0), TD(1), come out with conclusion in TD(λ).

TD(0) Algorithm: λ=0

This is by taking λ=0 in TD(λ) algorithm.

[The rule]

Eposide T:
For all S, e(S)=0 at start of eposide, VT(S)=VT1(S)
After St1rtSt:(from step t1 to t with reward rt)
e(St1)=e(St1)+1:
Update eligibility of St1 after arriving to St
For all S,
VT(S)=VT1(S)+αT(rt+γVT1(St)VT1(St1))…[A]
e(St1)=0γe(St1)=0:
before transite from St1 to St in next iteration, we can easily tell that the eligibility of state S is always zero.

[Notes]

➀the 2nd part of (A) is sum of the reward plus the the discounted value of the state we just arrived, minus the state value we just left; where these state values are all evaluated in last iteration. It could just be the temporal difference.
➁we are going to apply the temporal difference onto S itself only, with no proportion to the eligibility of any other states, and the learning rate would be specified for we don’t want it to move too much.
after the state has been iterated, decay or decrease its eligibility with λγ, in their given value, in TD(0), λ=0, the eligibility of state S is always zero.
➃and we are backing up to next stae.

[Caution]

all the S are all being done in similar approach.
➁the value at state S(the S in [A]) is going to be updated on this quantity, rt+γVT1(St)VT1(St1), which is the same for everybody, doesn't depend on which S we are updating, and e(S)=1 is specific to the state S at the moment we are evaluating(looking at).

MLE(Maximum Likelihood Estimate) And TD(0) Algorithm

[1] MLE

Given data of a lot trajectories and we are under the condition that we are in St1, and we don’t know what state we are going to end up in. But, there exists some probability of rt+γVT1(St)-VT1(St1).

If we take expectation of [A], then:
VT(St1)=ESt[rt+γVT1(St)]…[B]
We could treat it as the MLE of VT(St1) over all possible VT1(St). What we are doing is just sampling different possible St values, that is VT1(St), and is crossing over different trajectories for we have dataset of trajectories.

We are just taking an expectation over what we get as the next state of the reward plus the discounted estimated value(evaluated in last eposide) of that next state.

[2] Why [B] is supposed to be the MLE?

[B] is the MLE

proof::mjtsai1974

VT(St1)
=VT1(St1)+αT(rt+γVT1(St)VT1(St1))
=VT1(St1)(1αT)+αT(rt+γVT1(St))…[B.1]
This is the value of St1 in eposide T, when transiting from St1 to St, and remind that we are given data of a lot trajectories with each containing state transition in it.
➁suppose we are in St1 in one trajectory, and the next St does exist, there exists k such St and nk different states of St, and totally n trajectories in the given data. Therefore, there exists some probability P(St|St1)=kn.
VT(St1)
=StP(St|St1)([B.1])
=ESt[rt+γVT1(St)] just holds.
, where VT1(St1)(1αT) and αT(rt+γVT1(St)) are some values varying in each trajectory, the former is the departuring state, the later is the ending state, αT is the learning rate, depends on how you would like the learning process to be. I just use the term rt+γVT1(St) to be the random variable to be taken expectation.

Illustration: The Idea Of TD(0) After TD(1)

[1] The given example

Succeeding to the same example as part 1 article, this time we are given data of 5 trajectories, the red quantity is the value of the state after backup propagation with γ=1. Recall that each distinct state’s initial value is 0. We’d like to ask for the value of S2 in the 5-th eposide.

[2] By using TD(1)

➀we can deduce it out the temporal difference term:
VT(S2)
=r2+γr3+γ2r5+γ3VT1(SF)VT1(S2)
=12
VT(S2)
=VT1(S2)+VT(S2)
=0+12
=12

[3] By using MLE

VT(S2)
=r2+γP(S3|S2)VT(S3)
=r2+γP(S3|S2)[r3
+γ(P(S4|S3)(VT(S4))
+P(S5|S3)(VT(S5))]
=r2+γP(S3|S2)[r3
+γ(P(S4|S3)(r4+γP(SF|S4)VT(SF))
+P(S5|S3)(r5+γP(SF|S5)VT(SF)))]

, and from data, we have probability of transition that
γ=1
P(S3|S2)=1
P(S4|S3)=0.6
P(S5|S3)=0.4
P(SF|S4)=1
P(SF|S5)=1

Therefore,
VT(S2)
=2+11[0
+1(0.6(1+110)
+0.4(10+110))]
=6.6, by the same approach, VT(S1)=5.6.

Cautions must be made that by using MLE for VT(S2), we choose to refer to the same eposide T, not T1, why?
This is all due to we are doing the estimiation by using data.

[4] The breakdown

Please recall in Temporal Difference Learning - Part 1, I have figured out that VT(S1)=2.9, VT(S2)=3.9, VT(S3)=1.9, by using both backup propagation and expect discounted reward in the section of ReCap The Backup In Markov Chain.

The MLE of VT(S2)=6.6, which is more, whereas the TD(1) estimate is 12, a lot more, all wrong, except for VT(S2)=3.9 by using backup based on the whole model. Why MLE is less wrong? Why TD(1) estimate is so far off?
➀when we compute the TD(1) estimate, we use only the 5-th trajectory, one of the five trajectories to propagate the information back.
➁when we use MLE for the estimation, we use all data, more information, more accurate. The more might not be the better, but, we can resort the inaccuracy of MLE of VT(S2)=6.6 to that we don't have the complete data set to build this whole MC model.

Specifically be noted that TD(1) is alos be treated as outcome-based estimate for it uses the immedaite reward and ignores the intermediate encountered states' value.

[5] A tiny finding

We have a tiny finding that TD(0) relates current state's value to next most closed state's value, is more like MLE estimate, even if we are using it in one trajectory, the 5-th in this illustration.

The temporal difference is about to learn to make prediction of states value for these states transit over time in the unit of one distinct trajectory, whereas the MLE tends to estimate states’ value accrossing all trajectories in given sample.

The argument in between TD(0), TD(1) and MLE is in that we don't have the full image of the Markov chain model, with only a little sampling data.

The Outcome Based v.s. The Intermediate Estimate Based::mjtsai1974

[1] Summary of these equations

I’d like to step further to deeper inside in the concept with TD(0) and TD(1). Given below 3 expression:
VT(St1)
=VT1(St1)+αT(rt+γVT1(St)VT1(St1))…[A]
VT(St1)
=ESt[rt+γVT1(St)]…[B]
VT(St1)
=E[rt+γrt+1+γ2rt+2+γ3rt+3+]…[C]
, where [A] is the regular expression in temporal difference, works for both TD(0) and TD(1), the difference is in TD(0) rule the eligibility of the evaluated state would be reset to 0; the equation [B] is by taking expect of [A], more description is in TD(0) related section; whereas [C] is the idea by taking only the reward sequence that we saw, ignoring the estimate we might have gotten in some other states, which is the spiritual TD(1).

[2] Why toss out γkVT1(St+k)VT1(St1)?

Moreover, the full [C] expression should be refined as:
VT(St1)
=E[rt+γrt+1+γ2rt+2
++γk1rt+k1+γkVT1(St+k)VT1(St1)]…[D]

The reason we ignore these 2 terms γkVT1(St+k)VT1(St1) is that when k is quiet large, the γk would then approach 0, and VT1(St1)'s initial value is 0 in one eposide, if St1 is the target state to be evaluated, especially the very first time it is evaluated.

[3] Evaluation on [B] and [C]

By using [C], is just like the S2 in the 5-th trajectory in above illustrated example, however, when the trajectory is an infinite series, the TD(1) also does the right thing, repeating that update over and over again won't change anything, because the expectation is the expectation expressed in terms of the saw rewards.

By using [B], it takes the intermediate estimates that we have computed and refined on all the intermediate nodes, that is taking all the states we encountered along the way into concern to improve our estimate of the value of every other state.

Therefore, [B] is more self-consistent of connecting the value of states to the value of the other states you want(or encountered), and [C] is just using the experience that it saws and ignores the existence of the intermediate states.

[4] Cautions

All above are under the condition that we have been given partial, incomplete data before we know the full model of state transition, or even if we are given the complete data of a target model to be predicted, we still believe that we don’t have it yet!!

[5] The appropriate apply::mjtsai1974

Trivially, [D] relates the final state value of VT1(St+k) to the target evaluated state St1 in eposide T, whose value is expressed in terms of VT1(St1).
VT(St1)
=E[rt+γrt+1+γ2rt+2
++γk1rt+k1+γkVT1(St+k)VT1(St1)]

The point is:
how large k(the length of trajectory) should be for we to safely toss out these 2 terms?
before k is large enough, we should be able to calculate(or incorporate) new arrivaed state's value, which is reasonable to relate current arrived state to the target evaluated state, repeat this behavior until k is large enough.

[6] After k is large enough::mjtsai1974

Evenmore, after k is large enough to ignore these 2 terms, the algorithm should have a design to go back to re-calculate the target state’s value, the transition must range from St1 to St+k1, thus to move toward a bit more closed to have a much maturer condition to make a toggle of decision according to the new evaluated target state's value.

Finally, when k is large enough, it means that we have state transition over a rather long horizon, we are safe to just use the experience of saw rewards, since the update term of intermediate nodes would just be cancel out by the temporal difference equation(like [A] with λ0), thought by mjtsai1974, and might be evaluated by program in the future(to be conti).

Addendum

Temporal Difference Learning, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)