Markov Decision Process Framework
05 Dec 2017Markov Decision Process Framework
Conclusion
➀MDP is full observable, for each distinct state $S$, there exists actions $A$ associated with it.
➁the execution of action in the stochastic environment might not go as well as you had expected, the outcome is random. The state transition probability from state $S$ to $S’$ by action $A$ is $P(S’\left|S,A\right.)$.
➂there exists immediate reward, $R(S)$ or cost for each state $S$.
➃the objective is to find a policy $\pi:S\rightarrow A$ that can maximize the given expression:
\(E\left[\sum_{t=0}^\infty\gamma^t\cdot R_t\left|S_0\right.\right]\)
➄by value iteration, from Bellman inequality to the Bellman equality when the value function converges:
\(V(S)\leftarrow R(S)+\underset A{max}\left[\gamma\cdot\sum_{S'}P(S'\left|S,A\right.)\cdot V(S')\right]\)
\(V(S)=R(S)+\underset A{max}\left[\gamma\cdot\sum_{S'}P(S'\left|S,A\right.)\cdot V(S')\right]\)
➅after the value iteration has been converged, we’re able to define a policy by using the argmax to constraint the expected discounted future reward term in the value iteration expression:
\(\pi(S)=\underset A{armax}\sum_{S'}P(S'\left|S,A\right.)\cdot V(S')\) The $A$ thus obtained is the optimal action that maximizes the expected accumulated and discounted futured rewards of the state $S$.