Partial Observable Markov Decision Process - Part 4
04 Dec 2020Prologue To Partial Observable Markov Decision Process - Part 4
Recap POMDP Value Iteration
Basic idea in POMDP value iteration
➀calculate value function vectors for each action with regard to observation thus made in horizon 1.
➁continue looking forward horizon 2,3,…
➂make such value iteration until convergenceThe PWLC(piecewise linear convex)
The value function of POMDP can be represented as max of linear segments - PWLC.
(1)convexity:
state is known at the edges of belief space
(2)linear segments:
➀linear function in the format of belief × reward
➁segments in horizon 1 are linear
➂segments in horizon n are linear combination of horizon n−1's segments
Illustration: Why And How PWLC Works?
In this section, I am going to lead you to the reason to why PWLC takes effect, and guide you through the way it is aggregated to a convergence.
The value function of POMDP can be represented as max of linear segments - PWLC could be expressed:
[Question]
Vt(b)
=maxα∈τtα⋅b
=maxα∈τt∑sα(s)⋅b(s)Assume we have τt−1 such that Vt−1(b)=maxα∈τt−1α⋅b, you are ask to build τt such that Vt(b)=maxα∈τtα⋅b
[Answer]
proof::part-1
➀Vt(b)=maxa∈AVat(b)
⇒Vat(b)=∑oVa,ot(b)
⇒Va,ot(b)=∑sR(s,a)⋅b(s)|o|+γ⋅P(o|b,a)⋅Vt−1(S.E(b,a,o))
, where we have
Vt−1(S.E(b,a,o))
=maxα∈τt−1∑s′α(s′)⋅P(o|s′,a)⋅∑sP(s′|s,a)⋅b(s)P(o|b,a)
➁although Vt−1(S.E(b,a,o)) is highly non-linear, its awasome denominator part of P(o|b,a) could be safely eliminated and tossed out, thus we are left with a linear transformation from Va,ot(b) to its next Vt−1(S.E(b,a,o)), and S.E(b,a,o)=b′, which strongly supports value function transformation in Illustration Of PWLC(Piecewise Linear Convex) in POMDP - Part 3.proof::part-2
Succeeding to part-1,
τt=Uaτat
τat=⊕τa,ot
⇔Vat(b)=∑oVa,ot(b)
, where we have
➀finite state space, action space, observation space
➁|A|×|O| could be quiet big, but, it is finite.
➂|τt| could grow up doubly, exponentially with t, but, it is finite.We have |τt|=|τt−1||O|×|A|
τt is derived from the vectors of optimal actions, associated with the dedicated observation in τt−1.
The POMDP Value Iteration Drawbacks
➀time complexity is exponential in actions and observations.
➁dimensionality of belief space grows with state numbers.
➂exponential growth with number of iterations.Hugely intractable to solve optimality, value iteration aims for small POMDP problems.
Addendum
➀Partial Observable Markov Decision Process, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
➁POMDP Value Iteration Example, Tony, Brown University’s Computer Science Department
➂POMDP Tutorial, Stefan Kopp, Bielefeld University
➃Partially Observable Markov Decision Process, Geoff Hollinger, 2007 fall
➄POMDPOs, Rowan McAllister and Alexandre Navarro, MLG Reading Group, 02 June 2016, Cambridge University