mjtsai1974's Dev Blog Welcome to mjt's AI world

SMO Framework And Algorithm

SMO Framework And Algorithm

SMO algorithm is the realization of updating mandatory coefficients in the objective function based on the long and tedious deduction with a hope to regularize by optimization in accordance to the KKT rules that can sequentially costruct the boundarys of the safeguard.

SMO Framework

Let me make a brief summary of the major points from the initial idea to the end of the deduction of the objective function.

[1]Suppose we’d like to have a hyperplane with a safeguard that can classify the given data sample.
➀we formulate our problem as:
minw12wtw, subject to yi(wtxib)1,i.
The first part is the target, the second part is the constraint.
➁then, introduce α1,α2,αn to be the lagrange multiplier and express in below lagrangian to be our objective function:
L(w,b,α)=12wtwni=1αi(yi(wtxib)1)
➂next to regularize the objective function, to get the optimal α, we should take partial derivatives of L on w, b respectively and equate them to zero. Finally, we get:
L(w,b,α)=12ni,j=1αiαjyiyj(xtixj)+ni=1αi
where w=ni=1αiyixi.

By such design guarantees that we could have αi>0, one basic condition must be satisfied in SMO.

[2]Next to the imperfect separation with noise. There exists some condition that we don’t strictly enforce that no data points in between H1 and H2. We can extend SVM to allow some noise(data points) in between the safeguard zone. Thus, we want to penalize the data points that cross the boundaries(H1,H2).
➀we formulate our problem as:
minw,ξi[wtw+Cni=1ξi],

subject to

yi(wtxib)+ξi10, ξi0
Such design is to shrink down the distance between H1 and H2, thus to allow some noise within original margin.
➁next to build the lagrangian by introducing α1, α2,…, αn and μ1, μ2,…, μn, then:
L(w,b,ξ,α,μ)=12wtw+Cni=1ξini=1αi(yi(wtxib)+ξi1)ni=1μiξi
The lagrangian expression turns out to be:
L(w,b,ξ,α,μ)=12wtw+ni=1(Cαiμi)ξini=1(αiyixti)w+ni=1αiyib+ni=1αi
➂to get the maximum of L at α and ξ, below constraints must be satisfied for all i:
Lw=0, Lb=0, Lξ=0.
We have w=ni=1(αiyixi), ni=1αiyi=0,
and ni=1Cαiμi=0, for all i, Cαiμi=0 just holds.
➃to maximize L(w,b,ξ,α,μ) for the optimal αi value, then, we formulate the lagrangian as:
minαL(w,b,ξ,α,μ)=ni=1αi12ni,j=1αiαjyiyjxtixj
, subject to ni=1αiyi=0 and 0αiC for all i
➄Notes that αi0, μi0, therefore, we have 0αiC,
αi is now upper bounded by C.

[3]Then, we make introduction to the KKT conditions:
αi=0, Ri0
0<αi<C, Ri0
αi=C, Ri0
Above KKT cases are evaluated by below constraints under 0αiC:
Lξ=Cαiμi=0
αi(yi(wtxib)+ξi1)=0

Also, we give the KKT violating cases:
αi<C and Ri<0
αi>0 and Ri>0
αi=0 and Ri>0…by mjtsai

[4]Departure from noise
So far, we have our objective function as:
L(w,b,ξ,α,μ)=ni=1αi12ni,j=1αiαjyiyjxtixj
for all i, 0<αi<C, the non-boundary case.

[5]Works on 2 α’s at a time
The algorithm of SMO works by manipulating 2 α's at a time(with others fixed), a little hill climbing alike approach. By heuristics to choose 2 α’s at a time.
To optimize α1, α2 with other α's fixed, the objective function could be rewritten as:
L(w,b,ξ,α,μ)=α1+α2+const12(part1+par2)
All deduction is based on below equality:
α1+Sα2=const, α1=constSα2,
where α1+Sα2=αold1+Sαold2.

[6]Express α1 in terms Of α2
The major purpose is to express the objective function in terms of single αnew2, this is quiet a paintful process.
We’d like to toss out almost everything non-related with αnew2 and put it it the const term. The objective function becomes:
L(w,b,ξ,α,μ)=12(2K12K11K22)(αnew2)2+(1S+SK11rSK12r+y2V1y2V2)αnew2

[7]Transform from L(w,b,ξ,α,μ) to L(αnew2)
The major spirit of SMO is to optimize 2 α’s at a time, by transiting from the old α’s to the new α’s, more precisely, from αold1 to αnew1, and from αold2 to αnew2. First, we focus on αnew2, and laterly, get the αnew1 after we get αnew2.
➀relate αnew2 Back To αold2
➁introduction Of η by taking η=2K12K11K22, η0 is the validity of η.
Now, we formularize our problem in objective function of η, αnew2, αold2, Eold1, Eold2:
L(w,b,ξ,α,μ)=L(αnew2)=12η(αnew2)2+(y2(Eold1Eold2)ηαold2)αnew2+const

It is much simpler in its optimization, only αnew2 is left to be optimized.

[8]Examine the feasible rangle of new α value

For computation, η should be less than 0

, although it might be approaching to 0. We have the new α2 expressed in below:
αnew2=αold2+y2(Eold2Eold1)η
Then, the αnew2 thus obtained might be unconstrainted maximum value. The feasible range of αnew2 must be checked.
Take the minimum feasible of αnew2 be L, the maximum feasible of αnew2 be H, then:
αnew,clipped2={L,ifαnew2<Lαnew2,ifL<αnew2<HH,ifαnew2>H
We can finally go to clip the new α1, α2 value with sequantial updating in b and w.

Take α2=y2(Eold2Eold1)η
Recall that α1=rSα2, it holds for both new and old. Therefore,
lim
\Rightarrow\triangle\alpha_1=-S\cdot\lim_{\triangle\rightarrow0}\frac{(\alpha_2+\triangle)-\alpha_2}\triangle
\Rightarrow\triangle\alpha_1=-S\cdot\triangle\alpha_2
, where \triangle\alpha_1, \triangle\alpha_2 are the differentials or derivatives of \alpha_1, \alpha_2.

This is a quiet tedious, complicated process, but a beautiful framework!!

SMO Algorithm

We all have been given the concept that SMO is the process by randomly picking 2 \alpha’s at a time, such behavior is called SMO heuristics. The implementation or even algorithm flow could be found the existence of a lot variety. In this section, I’d like to review the algorithm in SMO original design. By usual, you can find one outer loop and one inner loop in the whole flow.

[First heuristic]
It is the outer loop for \alpha_1.
➀iterates all training set, if an example is examined as KKT violated; it is eligible for optimization.
➁search the entire training set for non-boundary examples, whose \alpha\neq0, \alpha\neq C, and 0<\alpha<C; if such example was found, it is treated as KKT violated, it is eligible for optimization.
➂keep seraching non-boundary examples, until all non-boundary examples obey KKT conditions within \varepsilon=0.001.

Where non-boundary case guarantees:

\left|(w^{new})^t\cdot x_i-y_i\right|\leq\varepsilon

, for i=1,2, and \varepsilon is a rather tiny quantity.

By above ➀, ➁, ➂, examples at boundary are likely to stay at boundary; examples at non-boundary are likely to move toward boundary as they have been optimized.

Why?
Because iteration focus on non-boundary cases and \alpha_1^{new}=\alpha_1-S\cdot \triangle\alpha_2 will move this example(\alpha_1^{new}) toward boundary.

➃after ➀, ➁, ➂ completed, SMO algorithm will start to search boundary case and optimize it if the case has been KKT violated due to non-boundary subset optimization.

[Second heuristic]
It is the inner loop for \alpha_2.
Once \alpha_1 has been chosen, \alpha_2 would be chosen based on the criteria to maximize the step size in each iteration:
\underset{i,j}{max}\left|E_i-E_j\right|, where i, j would be 1, 2. why?
Because it could speed up the converge from non-boundary location to the boundary side, where:
\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta;
\alpha_1^{new}=\alpha_1-S\cdot \triangle\alpha_2

Before the ending of SVM algorithm, positive progress means from 0<\alpha_i<C to \alpha_i\rightarrow0 or \alpha_i\rightarrow C. From the non-boundary case, then, back to whole training set for optimization by random pick.
At the end, this article would like to remind that not all iteration makes positive progress, if x_i=x_j, then \eta=0!!! Skip to other case is the recommendation from some code snippet.