Markov Decision Process Framework

05 Dec 2017

Markov Decision Process Framework

The MDP is quiet beautiful a framework, rather than using a single sequence of states and actions, as would be the case in the deterministic planning, now we make an entire field a so called policy that assigns an action to every possible state. And, we compute it using the technique of value iteration.

Conclusion

➀MDP is full observable, for each distinct state $S$, there exists actions $A$ associated with it.
➁the execution of action in the stochastic environment might not go as well as you had expected, the outcome is random. The state transition probability from state $S$ to $S’$ by action $A$ is $P(S’\left|S,A\right.)$.
➂there exists immediate reward, $R(S)$ or cost for each state $S$.
➃the objective is to find a policy $\pi:S\rightarrow A$ that can maximize the given expression:
$E\left[\sum_{t=0}^\infty\gamma^t\cdot R_t\left|S_0\right.\right]$
➄by value iteration, from Bellman inequality to the Bellman equality when the value function converges:
$V(S)\leftarrow R(S)+\underset A{max}\left[\gamma\cdot\sum_{S'}P(S'\left|S,A\right.)\cdot V(S')\right]$
$V(S)=R(S)+\underset A{max}\left[\gamma\cdot\sum_{S'}P(S'\left|S,A\right.)\cdot V(S')\right]$
➅after the value iteration has been converged, we’re able to define a policy by using the argmax to constraint the expected discounted future reward term in the value iteration expression:
$\pi(S)=\underset A{armax}\sum_{S'}P(S'\left|S,A\right.)\cdot V(S')$ The $A$ thus obtained is the optimal action that maximizes the expected accumulated and discounted futured rewards of the state $S$.

mjtsai1974's Dev Blog Welcome to mjt's AI world

Markov Decision Process Framework

Markov Decision Process Framework

Conclusion

Related Posts

Partial Observable Markov Decision Process - Part 4 04 Dec 2020

Partial Observable Markov Decision Process - Part 3 10 Oct 2020

Partial Observable Markov Decision Process - Part 2 13 Aug 2020