SMO Framework And Algorithm
13 Dec 2017SMO Framework And Algorithm
SMO Framework
Let me make a brief summary of the major points from the initial idea to the end of the deduction of the objective function.
[1]Suppose we’d like to have a hyperplane with a safeguard that can classify the given data sample.
➀we formulate our problem as:
$\underset w{min}\frac12w^t\cdot w$, subject to $y_i\cdot(w^t\cdot x_i-b)\geq1,\forall i$.
The first part is the target, the second part is the constraint.
➁then, introduce $\alpha_1,\alpha_2,\dots\alpha_n$ to be the lagrange multiplier and express in below lagrangian to be our objective function:
\(L(w,b,\alpha)=\frac12w^t\cdot w-\sum_{i=1}^n\alpha_i\cdot(y_i\cdot(w^t\cdot x_i-b)-1)\)
➂next to regularize the objective function, to get the optimal $\alpha$, we should take partial derivatives of $L$ on $w$, $b$ respectively and equate them to zero. Finally, we get:
\(\begin{array}{l}L(w,b,\alpha)\\=-\frac12\sum_{i,j=1}^n\alpha_i\cdot\alpha_j\cdot y_i\cdot y_j\cdot(x_i^t\cdot x_j)\\+\sum_{i=1}^n\alpha_i\end{array}\)
where $w=\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i$.By such design guarantees that we could have $\forall\alpha_i>0$, one basic condition must be satisfied in SMO.
[2]Next to the imperfect separation with noise. There exists some condition that we don’t strictly enforce that no data points in between $H_1$ and $H_2$. We can extend SVM to allow some noise(data points) in between the safeguard zone. Thus, we want to penalize the data points that cross the boundaries($H_1$,$H_2$).
subject to
➀we formulate our problem as:
$\underset{w,\xi_i}{min}\left[w^t\cdot w+C\cdot\sum_{i=1}^n\xi_i\right]$,$y_i\cdot(w^t\cdot x_i-b)+\xi_i-1\geq0$, $\forall\xi_i\geq0$
Such design is to shrink down the distance between $H_1$ and $H_2$, thus to allow some noise within original margin.
➁next to build the lagrangian by introducing $\alpha_1$, $\alpha_2$,…, $\alpha_n$ and $\mu_1$, $\mu_2$,…, $\mu_n$, then:
\(\begin{array}{l}L(w,b,\xi,\alpha,\mu)\\=\frac12\cdot w^t\cdot w+C\cdot\sum_{i=1}^n\xi_i\\\;\;\;\;-\sum_{i=1}^n\alpha_i\cdot(y_i\cdot(w^t\cdot x_i-b)+\xi_i-1)\\\;\;\;\;-\sum_{i=1}^n\mu_i\cdot\xi_i\end{array}\)
The lagrangian expression turns out to be:
\(\begin{array}{l}L(w,b,\xi,\alpha,\mu)\\=\frac12\cdot w^t\cdot w+\sum_{i=1}^n(C-\alpha_i-\mu_i)\xi_i\\\;\;\;\;-\sum_{i=1}^n(\alpha_i\cdot y_i\cdot x_i^t)\cdot w\\\;\;\;\;+\sum_{i=1}^n\alpha_i\cdot y_i\cdot b\\\;\;\;\;+\sum_{i=1}^n\alpha_i\end{array}\)
➂to get the maximum of $L$ at $\alpha$ and $\xi$, below constraints must be satisfied for all $i$:
$\frac{\partial L}{\partial w}=0$, $\frac{\partial L}{\partial b}=0$, $\frac{\partial L}{\partial \xi}=0$.
We have $w=\sum_{i=1}^n(\alpha_i\cdot y_i\cdot x_i)$, $\sum_{i=1}^n\alpha_i\cdot y_i=0$,
and $\sum_{i=1}^nC-\alpha_i-\mu_i=0$, for all $i$, $C-\alpha_i-\mu_i=0$ just holds.
➃to maximize $L(w,b,\xi,\alpha,\mu)$ for the optimal $\alpha_i$ value, then, we formulate the lagrangian as:
\(\begin{array}{l}\underset\alpha{min}L(w,b,\xi,\alpha,\mu)\\=\sum_{i=1}^n\alpha_i-\frac12\cdot\sum_{i,j=1}^n\alpha_i\cdot\alpha_j\cdot y_i\cdot y_j\cdot x_i^t\cdot x_j\end{array}\)
, subject to $\sum_{i=1}^n\alpha_i\cdot y_i=0$ and $0\leq\alpha_i\leq C$ for all $i$
➄Notes that $\alpha_i\geq0$, $\mu_i\geq0$, therefore, we have $0\leq\alpha_i\leq C$,
$\alpha_i$ is now upper bounded by $C$.
[3]Then, we make introduction to the KKT conditions:
➀$\alpha_i=0$, $R_i\geq0$
➁$0<\alpha_i<C$, $R_i\approx0$
➂$\alpha_i=C$, $R_i\leq0$
Above KKT cases are evaluated by below constraints under $0\leq\alpha_i\leq C$:
➀$\frac{\partial L}{\partial \xi}=C-\alpha_i-\mu_i=0$
➁$\alpha_i\cdot(y_i\cdot(w^t\cdot x_i-b)+\xi_i-1)=0$Also, we give the KKT violating cases:
➀$\alpha_i<C$ and $R_i<0$
➁$\alpha_i>0$ and $R_i>0$
➂$\alpha_i=0$ and $R_i>0$…by mjtsai
[4]Departure from noise
So far, we have our objective function as:
\(\begin{array}{l}L(w,b,\xi,\alpha,\mu)\\=\sum_{i=1}^n\alpha_i-\frac12\cdot\sum_{i,j=1}^n\alpha_i\cdot\alpha_j\cdot y_i\cdot y_j\cdot x_i^t\cdot x_j\end{array}\)
for all i, $0<\alpha_i<C$, the non-boundary case.
[5]Works on 2 $\alpha$’s at a time
The algorithm of SMO works by manipulating 2 $\alpha$'s at a time(with others fixed), a little hill climbing alike approach. By heuristics to choose 2 $\alpha$’s at a time.
To optimize $\alpha_1$, $\alpha_2$ with other $\alpha$'s fixed, the objective function could be rewritten as:
\(\begin{array}{l}L(w,b,\xi,\alpha,\mu)\\=\alpha_1+\alpha_2+const\\-\frac12\cdot(part\;1+par\;2)\end{array}\)
All deduction is based on below equality:
$\alpha_1+S\cdot \alpha_2=const$, $\alpha_1=const-S\cdot \alpha_2$,
where $\alpha_1+S\cdot \alpha_2=\alpha_1^{old}+S\cdot \alpha_2^{old}$.
[6]Express $\alpha_1$ in terms Of $\alpha_2$
The major purpose is to express the objective function in terms of single $\alpha_2^{new}$, this is quiet a paintful process.
We’d like to toss out almost everything non-related with $\alpha_2^{new}$ and put it it the $const$ term. The objective function becomes:
\(\begin{array}{l}L(w,b,\xi,\alpha,\mu)\\=\frac12\cdot(2K_{12}-K_{11}-K_{22})\cdot(\alpha_2^{new})^2\\\;\;\;\;+(1-S+S\cdot K_{11}\cdot r-S\cdot K_{12}\cdot r\\\;\;\;\;+y_2\cdot V_1-y_2\cdot V_2)\cdot\alpha_2^{new}\end{array}\)
[7]Transform from $L(w,b,\xi,\alpha,\mu)$ to $L(\alpha_2^{new})$
It is much simpler in its optimization, only $\alpha_2^{new}$ is left to be optimized.
The major spirit of SMO is to optimize 2 $\alpha$’s at a time, by transiting from the old $\alpha$’s to the new $\alpha$’s, more precisely, from $\alpha_1^{old}$ to $\alpha_1^{new}$, and from $\alpha_2^{old}$ to $\alpha_2^{new}$. First, we focus on $\alpha_2^{new}$, and laterly, get the $\alpha_1^{new}$ after we get $\alpha_2^{new}$.
➀relate $\alpha_2^{new}$ Back To $\alpha_2^{old}$
➁introduction Of $\eta$ by taking $\eta=2K_{12}-K_{11}-K_{22}$, $\eta\leq0$ is the validity of $\eta$.
Now, we formularize our problem in objective function of $\eta$, $\alpha_2^{new}$, $\alpha_2^{old}$, $E_1^{old}$, $E_2^{old}$:
\(\begin{array}{l}L(w,b,\xi,\alpha,\mu)\\=L(\alpha_2^{new})\\=\frac12\cdot\eta\cdot(\alpha_2^{new})^2\\\;\;\;\;+(y_2\cdot(E_1^{old}-E_2^{old})-\eta\cdot\alpha_2^{old})\cdot\alpha_2^{new}\\\;\;\;\;+const\end{array}\)
[8]Examine the feasible rangle of new $\alpha$ value
For computation, $\eta$ should be less than $0$, although it might be approaching to $0$. We have the new $\alpha_2$ expressed in below:
\(\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta\)
Then, the $\alpha_2^{new}$ thus obtained might be unconstrainted maximum value. The feasible range of $\alpha_2^{new}$ must be checked.
Take the minimum feasible of $\alpha_2^{new}$ be L, the maximum feasible of $\alpha_2^{new}$ be H, then:
\(\alpha_2^{new,clipped}=\left\{\begin{array}{l}L,if\;\alpha_2^{new}<L\\\alpha_2^{new},if\;L<\alpha_2^{new}<H\\H,if\;\alpha_2^{new}>H\end{array}\right.\)
We can finally go to clip the new $\alpha_1$, $\alpha_2$ value with sequantial updating in $b$ and $w$.Take $\triangle\alpha_2=\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta$
Recall that $\alpha_1=r-S\cdot \alpha_2$, it holds for both new and old. Therefore,
\(\lim_{\triangle\rightarrow0}\frac{(\alpha_1+\triangle)-\triangle}\triangle=\lim_{\triangle\rightarrow0}\frac{r-S\cdot(\alpha_2+\triangle)-(r-S\cdot\alpha_2)}\triangle\)
\(\Rightarrow\triangle\alpha_1=-S\cdot\lim_{\triangle\rightarrow0}\frac{(\alpha_2+\triangle)-\alpha_2}\triangle\)
\(\Rightarrow\triangle\alpha_1=-S\cdot\triangle\alpha_2\)
, where $\triangle\alpha_1$, $\triangle\alpha_2$ are the differentials or derivatives of $\alpha_1$, $\alpha_2$.This is a quiet tedious, complicated process, but a beautiful framework!!
SMO Algorithm
We all have been given the concept that SMO is the process by randomly picking 2 $\alpha$’s at a time, such behavior is called SMO heuristics. The implementation or even algorithm flow could be found the existence of a lot variety. In this section, I’d like to review the algorithm in SMO original design. By usual, you can find one outer loop and one inner loop in the whole flow.
[First heuristic]
It is the outer loop for $\alpha_1$.
➀iterates all training set, if an example is examined as KKT violated; it is eligible for optimization.
➁search the entire training set for non-boundary examples, whose $\alpha\neq0$, $\alpha\neq C$, and $0<\alpha<C$; if such example was found, it is treated as KKT violated, it is eligible for optimization.
➂keep seraching non-boundary examples, until all non-boundary examples obey KKT conditions within $\varepsilon=0.001$.Where non-boundary case guarantees:
$\left|(w^{new})^t\cdot x_i-y_i\right|\leq\varepsilon$, for $i=1,2$, and $\varepsilon$ is a rather tiny quantity.
By above ➀, ➁, ➂, examples at boundary are likely to stay at boundary; examples at non-boundary are likely to move toward boundary as they have been optimized.Why?
Because iteration focus on non-boundary cases and $\alpha_1^{new}=\alpha_1-S\cdot \triangle\alpha_2$ will move this example($\alpha_1^{new}$) toward boundary.➃after ➀, ➁, ➂ completed, SMO algorithm will start to search boundary case and optimize it if the case has been KKT violated due to non-boundary subset optimization.
[Second heuristic]
It is the inner loop for $\alpha_2$.
Once $\alpha_1$ has been chosen, $\alpha_2$ would be chosen based on the criteria to maximize the step size in each iteration:
$\underset{i,j}{max}\left|E_i-E_j\right|$, where $i$, $j$ would be $1$, $2$. why?
Because it could speed up the converge from non-boundary location to the boundary side, where:
➀$\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta$;
➁$\alpha_1^{new}=\alpha_1-S\cdot \triangle\alpha_2$
Before the ending of SVM algorithm, positive progress means from $0<\alpha_i<C$ to $\alpha_i\rightarrow0$ or $\alpha_i\rightarrow C$. From the non-boundary case, then, back to whole training set for optimization by random pick.
At the end, this article would like to remind that not all iteration makes positive progress, if $x_i=x_j$, then $\eta=0$!!! Skip to other case is the recommendation from some code snippet.