mjtsai1974's Dev Blog Welcome to mjt's AI world

Partial Observable Markov Decision Process - Part 4

Prologue To Partial Observable Markov Decision Process - Part 4

This post will explain why PWLC works and how it is aggregated in value iteration of POMDP(Partial Observable Markov Decision Process).

Recap POMDP Value Iteration

  • Basic idea in POMDP value iteration
    ➀calculate value function vectors for each action with regard to observation thus made in horizon 1.
    ➁continue looking forward horizon 2,3,…
    ➂make such value iteration until convergence

  • The PWLC(piecewise linear convex)
    The value function of POMDP can be represented as max of linear segments - PWLC.
    (1)convexity:
    state is known at the edges of belief space
    (2)linear segments:
    ➀linear function in the format of belief × reward
    ➁segments in horizon 1 are linear
    segments in horizon n are linear combination of horizon n1's segments

Illustration: Why And How PWLC Works?

In this section, I am going to lead you to the reason to why PWLC takes effect, and guide you through the way it is aggregated to a convergence.

The value function of POMDP can be represented as max of linear segments - PWLC could be expressed:
Vt(b)
=maxατtαb
=maxατtsα(s)b(s)

[Question]

Assume we have τt1 such that Vt1(b)=maxατt1αb, you are ask to build τt such that Vt(b)=maxατtαb

[Answer]
  • proof::part-1
    Vt(b)=maxaAVat(b)
    Vat(b)=oVa,ot(b)
    Va,ot(b)=sR(s,a)b(s)|o|+γP(o|b,a)Vt1(S.E(b,a,o))
    , where we have
    Vt1(S.E(b,a,o))
    =maxατt1sα(s)P(o|s,a)sP(s|s,a)b(s)P(o|b,a)
    ➁although Vt1(S.E(b,a,o)) is highly non-linear, its awasome denominator part of P(o|b,a) could be safely eliminated and tossed out, thus we are left with a linear transformation from Va,ot(b) to its next Vt1(S.E(b,a,o)), and S.E(b,a,o)=b, which strongly supports value function transformation in Illustration Of PWLC(Piecewise Linear Convex) in POMDP - Part 3.

  • proof::part-2
    Succeeding to part-1,
    τt=Uaτat
    τat=τa,ot
    Vat(b)=oVa,ot(b)
    , where we have
    ➀finite state space, action space, observation space
    |A|×|O| could be quiet big, but, it is finite.
    |τt| could grow up doubly, exponentially with t, but, it is finite.

We have |τt|=|τt1||O|×|A|

τt is derived from the vectors of optimal actions, associated with the dedicated observation in τt1.

The POMDP Value Iteration Drawbacks

➀time complexity is exponential in actions and observations.
➁dimensionality of belief space grows with state numbers.
➂exponential growth with number of iterations.

Hugely intractable to solve optimality, value iteration aims for small POMDP problems.

Addendum

Partial Observable Markov Decision Process, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)
POMDP Value Iteration Example, Tony, Brown University’s Computer Science Department
POMDP Tutorial, Stefan Kopp, Bielefeld University
Partially Observable Markov Decision Process, Geoff Hollinger, 2007 fall
POMDPOs, Rowan McAllister and Alexandre Navarro, MLG Reading Group, 02 June 2016, Cambridge University