Linear Programming In Value Iteration
10 May 2019Prologue To Linear Programming In Value Iteration
$\frac {1}{1-\gamma}$ Isn't Polynomial $\approx$ Isn't Proportional
The value iteration identifies an optimal policy and polinomial time in $\frac {1}{1-\gamma}$, where $\lim_{\gamma\rightarrow 1}\frac {1}{1-\gamma}$=$\infty$, just explodes. That’s why we need the linear programming to solve MDPs in a reasonable amount of time.
What’s Linear Programming?
[1] Encode MDP solution as a linear programIn my prior post Lagrange Multiplier, you can find some similarity. In this post, we’d like to dive into deep level to solve MDPs by means of linear programming:
[2] How to solve a MDP?
➀it’s an optimization framework, in which you can give linear constraint in a linear objective function.
➁as long as the number of variables and costraints are polynomial, we can get a solution in polynomial time.
➂we need to encode MDP solution as a linear program, we thus can have linear constraint(s) and a linear objective function.For the time being, in this series of RL posts, we just need to solve the Bellman equation:
[3] Do we have a way to solve [A]?
$\forall S, V(S)$=$max_{A}(R(S,A)+\gamma\sum_{S’}P(S’\vert S,A)\cdot V(S’))$…[A]
➀for each state $S$, we have a variable $V(S)$ and relate each distinct $V(S)$ to its next $V(S’)$.
➁we thus have ➀ to be a set of constraints. If we could solve this set of constraints, it is suggested to be a good departure point.Unlessthe $max_{A}$ has become linear, the answer is no.
The max over action is not linear, it isn't translatable directly to a set of linear equations and a linear objective function. In the very beginning, we are given a set of non-linear equations.
We should express the max over action in terms of a set of linear constraints and a linear objective function.
[4] Example of $max(-3,7,2,5)=X$Given this example of max, below is a set of inequality constraints:
➀$X\geq -3$
➁$X\geq 7$
➂$X\geq 2$
➃$X\geq 5$Here we ponder if the solution $7$ to this set of inequality constraints is exactly the max? The answer is no, and why? Because $9$, $10$, these numbers are also greater than or equal to all of these things(the set of constraints).
What we want is to pick up the smallest one within those number greater than and equal to above set of inequalities, in this example, you can’t get any number that is smaller than $7$, thta is $min X$=$7$.
Is $min X$ a linear operator? No, it is just a linear objective function!!! Next, we are going to use this idea to generalize a linear objective function of the Bellman equation in a very similar way.
The Linear Programming: Primal
[1] Refine the Bellman equationSucceeding to above idea that we’d like to pick up the smallest one from within all possible value functions of maximum, we should refine our Bellman equation in [A] as below:
[2] What and how do we minimize here?
$\forall S,A\;V(S)\geq R(S,A)+\gamma\sum_{S’}P(S’\vert S,A)\cdot V(S’)$…[B]
➀for all states and actions, the value of a state is greater than or equal to the right part of the original expression, and we say the new expression of inequality [B].
➁the whole right part of [B] is just the Q-value.Caution must be made that we are given a set of sampling data of a MDP model:
➀since inequality [B] is refined for all states and actions, what we want to minimize is not a single state.
➁inequality [B] aims at all states, what we should minimize is a vector.
➂a single $V(S)$ is unbounded, we should as a whole evaluate all states in one minimize operation.Due to above concerns in ➀,➁,➂, it turns out to minimize the sum of all states to make it work:
$min\;\sum_{S}V(S)$…[C]So, $min\;\sum_{S}V(S)$ is going to operate on all the individual $V(S)$ and make each of them to be as small as they can be, so that it actually equals the $max\;V(S)$ for all $S$.
Besides, $min\;\sum_{S}V(S)$ isn't an upper bound on the max, if any distinct $V(S)$ is an upper bound, then you won't have the minimum sum. You can always move it down a little bit.
[3] The primalThis is atually a linear program by putting [B] and [C] together, some textbook name it the primal:
$\;\;\forall S,A\;V(S)\geq R(S,A)+\gamma\sum_{S’}P(S’\vert S,A)\cdot V(S’)$
$\;\;\;\;min\;\sum_{S}V(S)$You can regard [C] as the linear objective function. To be believed that it is the solution equivalent to the solution to the MDP. We can just write down this linear program and give it to a linear program solver that runs in polynomial time and finally gets $V(S)$ for all state $S$.
How do we get our policy from that? We just choose the action that always returns the best(maximum) value, choose this greedy policy with respect to that optimal value function, and we are doing it in the unit of each distinct state.
Linear Programming Duality
[1] A mechanical processOne more thing about linear programming is by switching from the primal to what’s called the dual. It has such nice property that by:
➀changing constraints into variables
➁changing variables into constraintsYou can get a new linear programming that is equivalent to the original old one, this is called the linear programming duality. Sometimes, it would be useful to solve a MDP by putting bounds and constraints on the solutions.
We treat the process of producing the dual of linear programming to be just a mechanical process.
[2] The dual$max_{q_{S,A}}\sum_{S}\sum_{A}q_{S,A}\cdot R(S,A)$…[D]
$\;\;$such that $1$+$\gamma\cdot\sum_{S}\sum_{A}q_{S,A}\cdot P(S’\vert S,A)$=$\sum_{A’}q_{S’,A’}$…[E]
$\;\;\;\forall S,A\;q_{S,A}\geq 0$…[F]➀[D] is the linear objective function.
➁[E] and [F] are the constraints.We perform a series of steps to turn each constraint from the primal becomes variable in the dual, each variable in the primal becomes constraint in the dual, and maximum becomes minimum, bound becomes objective function.
In more detail, the $V(S)$ and $R(S,A)$ term in [B], the constraint of primal, has been turned into the variables in the linear objective function expressed by [D] in dual, with $V(S)$ be replaced with $q_{S,A}$, which is the next new introduced concept, the aganet flow.
[3] New introduced policy flowIn this dual, the optimization is to maximize the sum over all state-action pairs of the reward for that state-action pair times the value of $q_{S,A}$, which is the new introduced policy flow; it’s not the same Q-value.
The policy flow, $q_{S,A}$ could be regarded as how much agentness is flowing through each state-action pair. If it follows the policy, it is going to spend some time running in that environment of MDP, passing through each state-action pair. When it is transiting from current state-action pair to its next state-action pair, it is discounted by $\gamma$.
Here is the concept that each time the policy flow passes through a state-action pair, we’re going to get the reward associated with that state-action pair. What we’d like to do is to maximize the reward according to this concept by means of $max_{q_{S,A}}\sum_{S}\sum_{A}q_{S,A}\cdot R(S,A)$.
[4] Min in primal vs max in dualIt’s more intuitive for we do the maximizing job in the dual, comparing with the minimum we do in the primal. What we don't want is to have an upper bound in the primal, that’s why we choose to minimize the sum of all states’ value.
We choose $max_{q_{S,A}}\sum_{S}\sum_{A}q_{S,A}\cdot R(S,A)$, [D] in the dual to maximize our reward each time the agent flow passing through each state-action pair, how to guarantee that we won't have an upper bound of each or some state-action pair? This is a good question.
It’s going to be subject to the following idea, or just the constraint.
[5] $1$+$\gamma\cdot\sum_{S}\sum_{A}q_{S,A}\cdot P(S'\vert S,A)$=$\sum_{A'}q_{S',A'}, \forall A'$The answer is that we are constrainting this maximum to [D], that’s a constraint, and for each state, in this case, would be better if we think oif them as next state.
For each possible next state $S'$, our design expects it to be true that the amount of policy flow through that next state $S'$, summed up over the actions $A'$ that are going out through it, should be equal to the number of times that the next state $S'$ is visited, which we can get by summing over all states(represented by $S$) we might have started out, and actions $A$ we might have taken from $S$.
➀the policy flow through that state-action pair times the transition probability that would send us from $S$ to $S’$ by action $A$.
➁we include $1$ in it, because we can also start any given state-action pair.So, there is some policy flow we’re injecting into each state-action pair of the system. And we are going to add to this same state-action pair any other policy flow that would be coming through it, after that, in turns through other parts of the system.
[6] It is discounted by $\gamma$It’s like we are dumping policy flow over MDP, then let the MDP pump around what’s left.
[7] The additional constraint: $\forall S,A\;q_{S,A}\geq 0$
It is discounted by $\gamma$. We ultimately get the maximum possible reward by $max_{q_{S,A}}\sum_{S}\sum_{A}q_{S,A}\cdot R(S,A)$, if we just follow the greedy policy at each state we have visited.This additional constraint just to guarantee 2 things:
➀there is no negative policy flow.
➁the policy flow in one place wouldn't be overestimated, so that it wouldn't then be drained out in some other place to blance it.It is a meaning flow of balance, so the amount of stuff going into $S'$ has to be equal to the amount of stuff coming out of it.
Subject to this constraint of $1$+$\gamma\cdot\sum_{S}\sum_{A}q_{S,A}\cdot P(S’\vert S,A)$=$\sum_{A’}q_{S’,A’}, \forall A’$, would we maximize the reward by $max_{q_{S,A}}\sum_{S}\sum_{A}q_{S,A}\cdot R(S,A)$.