SMO Framework And Algorithm
13 Dec 2017SMO Framework And Algorithm
SMO Framework
Let me make a brief summary of the major points from the initial idea to the end of the deduction of the objective function.
[1]Suppose we’d like to have a hyperplane with a safeguard that can classify the given data sample.
➀we formulate our problem as:
minw12wt⋅w, subject to yi⋅(wt⋅xi−b)≥1,∀i.
The first part is the target, the second part is the constraint.
➁then, introduce α1,α2,…αn to be the lagrange multiplier and express in below lagrangian to be our objective function:
L(w,b,α)=12wt⋅w−∑ni=1αi⋅(yi⋅(wt⋅xi−b)−1)
➂next to regularize the objective function, to get the optimal α, we should take partial derivatives of L on w, b respectively and equate them to zero. Finally, we get:
L(w,b,α)=−12∑ni,j=1αi⋅αj⋅yi⋅yj⋅(xti⋅xj)+∑ni=1αi
where w=∑ni=1αi⋅yi⋅xi.By such design guarantees that we could have ∀αi>0, one basic condition must be satisfied in SMO.
[2]Next to the imperfect separation with noise. There exists some condition that we don’t strictly enforce that no data points in between H1 and H2. We can extend SVM to allow some noise(data points) in between the safeguard zone. Thus, we want to penalize the data points that cross the boundaries(H1,H2).
subject to
➀we formulate our problem as:
minw,ξi[wt⋅w+C⋅∑ni=1ξi],yi⋅(wt⋅xi−b)+ξi−1≥0, ∀ξi≥0
Such design is to shrink down the distance between H1 and H2, thus to allow some noise within original margin.
➁next to build the lagrangian by introducing α1, α2,…, αn and μ1, μ2,…, μn, then:
L(w,b,ξ,α,μ)=12⋅wt⋅w+C⋅∑ni=1ξi−∑ni=1αi⋅(yi⋅(wt⋅xi−b)+ξi−1)−∑ni=1μi⋅ξi
The lagrangian expression turns out to be:
L(w,b,ξ,α,μ)=12⋅wt⋅w+∑ni=1(C−αi−μi)ξi−∑ni=1(αi⋅yi⋅xti)⋅w+∑ni=1αi⋅yi⋅b+∑ni=1αi
➂to get the maximum of L at α and ξ, below constraints must be satisfied for all i:
∂L∂w=0, ∂L∂b=0, ∂L∂ξ=0.
We have w=∑ni=1(αi⋅yi⋅xi), ∑ni=1αi⋅yi=0,
and ∑ni=1C−αi−μi=0, for all i, C−αi−μi=0 just holds.
➃to maximize L(w,b,ξ,α,μ) for the optimal αi value, then, we formulate the lagrangian as:
minαL(w,b,ξ,α,μ)=∑ni=1αi−12⋅∑ni,j=1αi⋅αj⋅yi⋅yj⋅xti⋅xj
, subject to ∑ni=1αi⋅yi=0 and 0≤αi≤C for all i
➄Notes that αi≥0, μi≥0, therefore, we have 0≤αi≤C,
αi is now upper bounded by C.
[3]Then, we make introduction to the KKT conditions:
➀αi=0, Ri≥0
➁0<αi<C, Ri≈0
➂αi=C, Ri≤0
Above KKT cases are evaluated by below constraints under 0≤αi≤C:
➀∂L∂ξ=C−αi−μi=0
➁αi⋅(yi⋅(wt⋅xi−b)+ξi−1)=0Also, we give the KKT violating cases:
➀αi<C and Ri<0
➁αi>0 and Ri>0
➂αi=0 and Ri>0…by mjtsai
[4]Departure from noise
So far, we have our objective function as:
L(w,b,ξ,α,μ)=∑ni=1αi−12⋅∑ni,j=1αi⋅αj⋅yi⋅yj⋅xti⋅xj
for all i, 0<αi<C, the non-boundary case.
[5]Works on 2 α’s at a time
The algorithm of SMO works by manipulating 2 α's at a time(with others fixed), a little hill climbing alike approach. By heuristics to choose 2 α’s at a time.
To optimize α1, α2 with other α's fixed, the objective function could be rewritten as:
L(w,b,ξ,α,μ)=α1+α2+const−12⋅(part1+par2)
All deduction is based on below equality:
α1+S⋅α2=const, α1=const−S⋅α2,
where α1+S⋅α2=αold1+S⋅αold2.
[6]Express α1 in terms Of α2
The major purpose is to express the objective function in terms of single αnew2, this is quiet a paintful process.
We’d like to toss out almost everything non-related with αnew2 and put it it the const term. The objective function becomes:
L(w,b,ξ,α,μ)=12⋅(2K12−K11−K22)⋅(αnew2)2+(1−S+S⋅K11⋅r−S⋅K12⋅r+y2⋅V1−y2⋅V2)⋅αnew2
[7]Transform from L(w,b,ξ,α,μ) to L(αnew2)
It is much simpler in its optimization, only αnew2 is left to be optimized.
The major spirit of SMO is to optimize 2 α’s at a time, by transiting from the old α’s to the new α’s, more precisely, from αold1 to αnew1, and from αold2 to αnew2. First, we focus on αnew2, and laterly, get the αnew1 after we get αnew2.
➀relate αnew2 Back To αold2
➁introduction Of η by taking η=2K12−K11−K22, η≤0 is the validity of η.
Now, we formularize our problem in objective function of η, αnew2, αold2, Eold1, Eold2:
L(w,b,ξ,α,μ)=L(αnew2)=12⋅η⋅(αnew2)2+(y2⋅(Eold1−Eold2)−η⋅αold2)⋅αnew2+const
[8]Examine the feasible rangle of new α value
For computation, η should be less than 0, although it might be approaching to 0. We have the new α2 expressed in below:
αnew2=αold2+y2⋅(Eold2−Eold1)η
Then, the αnew2 thus obtained might be unconstrainted maximum value. The feasible range of αnew2 must be checked.
Take the minimum feasible of αnew2 be L, the maximum feasible of αnew2 be H, then:
αnew,clipped2={L,ifαnew2<Lαnew2,ifL<αnew2<HH,ifαnew2>H
We can finally go to clip the new α1, α2 value with sequantial updating in b and w.Take △α2=y2⋅(Eold2−Eold1)η
Recall that α1=r−S⋅α2, it holds for both new and old. Therefore,
lim
\Rightarrow\triangle\alpha_1=-S\cdot\lim_{\triangle\rightarrow0}\frac{(\alpha_2+\triangle)-\alpha_2}\triangle
\Rightarrow\triangle\alpha_1=-S\cdot\triangle\alpha_2
, where \triangle\alpha_1, \triangle\alpha_2 are the differentials or derivatives of \alpha_1, \alpha_2.This is a quiet tedious, complicated process, but a beautiful framework!!
SMO Algorithm
We all have been given the concept that SMO is the process by randomly picking 2 \alpha’s at a time, such behavior is called SMO heuristics. The implementation or even algorithm flow could be found the existence of a lot variety. In this section, I’d like to review the algorithm in SMO original design. By usual, you can find one outer loop and one inner loop in the whole flow.
[First heuristic]
It is the outer loop for \alpha_1.
➀iterates all training set, if an example is examined as KKT violated; it is eligible for optimization.
➁search the entire training set for non-boundary examples, whose \alpha\neq0, \alpha\neq C, and 0<\alpha<C; if such example was found, it is treated as KKT violated, it is eligible for optimization.
➂keep seraching non-boundary examples, until all non-boundary examples obey KKT conditions within \varepsilon=0.001.Where non-boundary case guarantees:
\left|(w^{new})^t\cdot x_i-y_i\right|\leq\varepsilon, for i=1,2, and \varepsilon is a rather tiny quantity.
By above ➀, ➁, ➂, examples at boundary are likely to stay at boundary; examples at non-boundary are likely to move toward boundary as they have been optimized.Why?
Because iteration focus on non-boundary cases and \alpha_1^{new}=\alpha_1-S\cdot \triangle\alpha_2 will move this example(\alpha_1^{new}) toward boundary.➃after ➀, ➁, ➂ completed, SMO algorithm will start to search boundary case and optimize it if the case has been KKT violated due to non-boundary subset optimization.
[Second heuristic]
It is the inner loop for \alpha_2.
Once \alpha_1 has been chosen, \alpha_2 would be chosen based on the criteria to maximize the step size in each iteration:
\underset{i,j}{max}\left|E_i-E_j\right|, where i, j would be 1, 2. why?
Because it could speed up the converge from non-boundary location to the boundary side, where:
➀\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta;
➁\alpha_1^{new}=\alpha_1-S\cdot \triangle\alpha_2
Before the ending of SVM algorithm, positive progress means from 0<\alpha_i<C to \alpha_i\rightarrow0 or \alpha_i\rightarrow C. From the non-boundary case, then, back to whole training set for optimization by random pick.
At the end, this article would like to remind that not all iteration makes positive progress, if x_i=x_j, then \eta=0!!! Skip to other case is the recommendation from some code snippet.