mjtsai1974's Dev Blog Welcome to mjt's AI world

Inside Sequential Minimal Optimization

Inside Sequential Minimal Optimization

SMO(Sequential Minimal Optimization) is the regularization process by optimization in accordance to the rule that can minimize the cost error of the objective function sequentially.

Departure With Noise

We’d like to begin from the imperfect separation, since no sampling is fully qualified and could be well separated. Given the support vector formulated problem expressed in terms of αi, yi, xi; then the objective function:
L(w,b,ξ,α,μ)=ni=1αi12ni,j=1αiαjyiyjxtixj
for all i, 0<αi<C, the non-boundary case.

Works On 2 α’s At A Time

The algorithm of SMO works by manipulating 2 α's at a time(with others fixed), a little hill climbing alike approach. By heuristics to choose 2 α’s at a time.

In the begin of the whole flow, initialize αi=0 for i=1 to n. Suppose we randomly choose α1, α2 and denote it as αold1, αold2.
Due to ni=1αiyi=0, therefore we have:
y1α1+y2α2=y1αold1+y2αold2
This confines the optimization of α1, α2 is on a line.

Next to examine the relation in between of these 2 α1, α2 of old or new.
➀for y1y2, then, α1α2=r, where r is a constant.
➁for y1=y2, then, α1+α2=r.
Take S=y1y2, multiply y1α1+y2α2=const by y1, then,
α1+Sα2=const, we have:
α1=constSα2,
where α1+Sα2=αold1+Sαold2.

Next to optimize α1, α2 with other α's fixed, the objective function could be rewritten as:
L(w,b,ξ,α,μ)=α1+α2+const12(part1+par2)

➀we further express part1:
part1=i,j=1,2αiαjyiyjxtixj=α21y21xt1x1+α22y22xt2x2+2α1α2y1y2xt1x2

➁we express part2 as:
part2=j=ni=3;j=1,2αiαjyiyjxtixj+j=ni=1,2;j=3αiαjyiyjxtixj=2(ni=3αiyixti)(α1y1x1+α2y2x2)

Express α1 In Terms Of α2

Next to express α1 in terms of α2, this is a quiet painful and confused process, it took me some while, now I’d like to share you with my viewpoint in each step in the deduction.

From current objective function, some symptoms could be found that xi is associasted with xj.
Thus, we take K11=xt1x1, K22=xt2x2, K12=xt1x2.

Recall that it is deduced out by removing the old α1, α2 associated terms from the term w, here comes the question, can we relate the old w to the new evolved α1, α2?
Vj=ni=3αiyixtixj=(wold)txjαold1y1xt1xjαold2y2xt2xj=(wold)txjbold+boldαold1y1xt1xjαold2y2xt2xj=Uj+boldαold1y1xt1xjαold2y2xt2xj

where Uj=(wold)txjbold is the output of xj under the old parameters, wold, bold. Expression in this way with a hope to refine original objective function.

In order to make it the point, the subscript of the term indicates the index, usually in this proof, they are i,j or 1,2…, the superscript is ued for the identity of old or new clipped term.

For the simplicity of the deduction and understanding, if no any word like old, new in the superscript, then, the term is treated as new term. Why by default is the term of new version? Because the nature design in SMO algorithm works by clipping 2 new terms of α’s at a time, the objective function would then be refined as the expression of 2 new terms of α’s.

➀deduce by replacing above terms in the objective function:
L(w,b,ξ,α,μ)=α1+α2+const12(K11α21+K22α22+2SK12α1α2+2α1y1V1+2α2y2V2)
➁next to replace α1 with α2, more precisely, replace the new α1 with the new α2:
=(rSα2)+α2+const12(K11(rSα2)2+K22α22+2SK12(rSα2)α2+2y1V1(rSα2)+2α2y2V2)
toss out the r in the term (rSα2)(or we just put it in the term const), since it is just a constant and no association with it, as to the r associated in the remaining terms, it should be retained.
=(1S)α2+const12(K11(rSα2)2+K22α22+2SK12(rSα2)α2+2y1V1(rSα2)+2α2y2V2)
➃expand in terms of α2:
=(1S)α2+const12K11r2+K11rSα212K11S2α2212K22α22SK12rα2+S2K12α22y1V1r+y1V1Sα2y2V2α2
➄since we’d like to express α1 in terms of α2 in the new evolved objective function, it might be a good idea to put the r non-associated with α2 into the const term, where S2=1 and the 12K11r2, y1V1r should be tossed out:
=(1S)α2+const+K11rSα212K11S2α2212K22α22SK12rα2+S2K12α22+y1V1Sα2y2V2α2
➅it seems that the objective function has been well refined with only α2 in it, but, take a look at the term y1V1Sα2, it consists of y1 and α2, whereas, y1 is the signal of α1, and α2 should be associated with y2, here comes the tricky factor S=y1y2:
y1V1Sα2=y1V1y1y2α2=V1y2α2
where y21=1, thus, the regularized objective function in this paragraph would be:
=(1S)α2+const+K11rSα212K11S2α2212K22α22SK12rα2+S2K12α22+y2V1α2y2V2α2
➆now, we save with only the αnew2 and (αnew2)2 are left:
=12(2K12K11K22)(αnew2)2+(1S+SK11rSK12r+y2V1y2V2)αnew2

Relate αnew2 Back To αold2

The major spirit of SMO is to optimize 2 α’s at a time, by transiting from the old α’s to the new α’s, more precisely, from αold1 to αnew1, and from αold2 to αnew2. Now, we focus on αnew2, and laterly, by the relationship in between these 2 α’s:
α1+α2=r
α1α2=r
above equality holds for both the old α’s to the new α’s, we can get the αnew1 after we get αnew2.

To link αnew2 back to αold2 ,we start below procedure.
➀now, we focus on the coefficient of αnew2, expand from r and V1 and V2:
1S+SK11rSK12r+y2V1y2V2=1S+SK11(αold1+Sαold2)SK12(αold1+Sαold2)+y2(Uold1+boldαold1y1K11αold2y2K12)y2(Uold2+boldαold1y1K12αold2y2K22)
➁further expand:
=1S+SK11αold1+SK11Sαold2SK12αold1SK12Sαold2+y2Uold1+y2boldy2αold1y1K11y2αold2y2K12y2Uold2y2bold+y2αold1y1K12+y2αold2y2K22
➂eliminate these terms SK11αold1, SK12αold1 and y2bold: ➃we obtain below equality:
=1S+(K112K12+K22)αold2+y2(Uold1Uold2)
, where Uoldj=(wold)txjbold.
➄enrich the equation with terms that is related to α2 might be helpful:
=(y2)2y1y2+(K112K12+K22)αold2+y2(Uold1Uold2)
, where (y2)2=1 and y1y2=S.

Introduction Of η

Take a look at above ➄, there exists a term (K112K12+K22) quiet similar with the term (2K12K11K22), which is the coefficient of the term (αnew2)2.

Next is to further refine the objective funection by introducing η.

Take η=2K12K11K22, then, the above ➄ becomes:
=(y2)2y1y2ηαold2+y2(Uold1Uold2) =y2(y2y1+Uold1Uold2)ηαold2
=y2(Uold1y1(Uold2y2))ηαold2
=y2(Eold1Eold2)ηαold2
where Eold1=Errold(x1,y1)=(wold)tx1boldy1,
and Eold2=Errold(x2,y2)=(wold)tx2boldy2.
They are just the error caused by the wold, when αold1 and αold2 in it.

Now, we formularize our problem in objective function of η, αnew2, αold2, Eold1, Eold2:
L(w,b,ξ,α,μ)=L(αnew2)=12η(αnew2)2+(y2(Eold1Eold2)ηαold2)αnew2+const

Transform from L(w,b,ξ,α,μ) to L(αnew2) is much simpler in its optimization

, since, only αnew2 is left to be optimized, the regularization cost of computation is greatly reduced!!!

Now we come to the validity of η, to get the most optimal vale of αnew2:
Lαnew2=ηαnew2+y2(Eold1Eold2)ηαold2=0
L(αnew2)2=12η, where we have η=2K12K11K22.

By ➀, this coincides with our departure point in the section relate αnew2 back to αold2, we just make the point. We have:
αnew2=αold2y2(Eold1Eold2)η
Further refine, we can have:
αnew2=αold2+y2(Eold2Eold1)η

η0 is the validity of η

and must hold:
η=2K12K11K22=

For computation, \eta should be less than 0

, although it might be approaching to 0.

Feasible Rangle Of New \alpha Value

Suppose we are working under \eta<0 in this equality, and it is correct:
\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta
Then, the \alpha_2^{new} thus obtained might be unconstrainted maximum value. The feasible range of \alpha_2^{new} must be checked.

We’d like to examine 4 cases, begin from below 3 equality:
S=y_1\cdot y_2.
\alpha_1^{old}+S\cdot \alpha_2^{old}=r=\alpha_1^{new}+S\cdot \alpha_2^{new}.
0\leq\alpha_i\leq C, this should be inequality.

[Case 1]S=1, \alpha_1+\alpha_2=r, r>C, then, max(\alpha_2)=C, min(\alpha_2)=r-C
[Case 2]S=1, \alpha_1+\alpha_2=r, r<C, then, max(\alpha_2)=r, min(\alpha_2)=0
Where the case 1 and case 2 are found to have y_1=y_2, we’d like to get the low threshold by max(0, r-c), denote it as L; the hight threshold by min(r, C), denote it as H.
[Case 3]S=-1, \alpha_1-\alpha_2=r, r>0, then, max(\alpha_2)=C-r, min(\alpha_2)=0
[Case 4]S=-1, \alpha_1-\alpha_2=r, r<0, then, max(\alpha_2)=C, min(\alpha_2)=-r
Where the case 3 and case 4 are found to have y_1\neq y_2, we’d like to get the low threshold by max(0, -r), denote it as L; the hight threshold by min(C, C+r), denote it as H.

Take the minimum feasible of \alpha_2^{new} be L, the maximum feasible of \alpha_2^{new} be H, then:
\alpha_2^{new,clipped}=\left\{\begin{array}{l}L,if\;\alpha_2^{new}<L\\\alpha_2^{new},if\;L<\alpha_2^{new}<H\\H,if\;\alpha_2^{new}>H\end{array}\right.

Clip New \alpha

Succeeding to above section, in this paragraph, I’d like to summarize the process to clip \alpha_2^{new}, then, \alpha_1^{new}.

Begin by given y_1, y_2, K_{11}, K_{22}, K_{12}, E_1^{old}, E_2^{old}, where E_i^{old}=w^{old}\cdot x_i-b^{old}-y_i.
➀take \eta=2\cdot K_{12}-K_{11}-K_{22}, then \eta\leq 0 must hold.

➁if \eta<0, by prior induction, we have:
\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta
Take \triangle\alpha_2=\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta
Please recall that \alpha_1=r-S\cdot \alpha_2, it holds for both new and old. Therefore,
\lim_{\triangle\rightarrow0}\frac{(\alpha_1+\triangle)-\triangle}\triangle=\lim_{\triangle\rightarrow0}\frac{r-S\cdot(\alpha_2+\triangle)-(r-S\cdot\alpha_2)}\triangle
\Rightarrow\triangle\alpha_1=-S\cdot\lim_{\triangle\rightarrow0}\frac{(\alpha_2+\triangle)-\alpha_2}\triangle
\Rightarrow\triangle\alpha_1=-S\cdot\triangle\alpha_2
, where \triangle\alpha_1, \triangle\alpha_2 are the differentials or derivatives of \alpha_1, \alpha_2.

➂if \eta=0, most SMO dosument, just commends to evaluate the objective function at the 2 end points and set \alpha_2^{new} to the one that can make the larger objective function, thus, \alpha_2^{new} would be more closed to boundary!!!

Recap the objective function is now:
\begin{array}{l}L(\alpha_2^{new})\\=\frac12\cdot\eta\cdot(\alpha_2^{new})^2\\\;\;\;\;+(y_2\cdot(E_1^{old}-E_2^{old})-\eta\cdot\alpha_2^{old})\cdot\alpha_2^{new}\\\;\;\;\;+const\end{array}

Where mjtsai think, for \eta=0, alternative would be to abandom and switch to next set of 2 points, since the current evaluated 2 points might just be the duplicated or the overlapped cases(Coursera, M.L professor ANG’s SMO sample code does this).

As to the mirrored case, should keep them in the mutual exclusive list for later manipulation, once new iteration would like to choose the new 2 candidates, already paired 2 points should not be paired again, also by mjtsai.

SMO Updating

We now have the concept that \alpha_1^{new}, \alpha_2^{new} are updated by \triangle\alpha_1, \triangle\alpha_2, this section would like to guide you to update E_i, F_i, w and b.

[Updating b]
Take E(x,y)=\sum_{i=1}^n\alpha_i\cdot y_i\cdot\ x_i^t\cdot x-y-b to be the predict error:
\triangle E(x,y)=\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x-\triangle b just holds to be the change in E.

If 0<\alpha_1^{new}<C, then, we can treat the new value of E_1 be zero, that is E_1^{new}=0, since it is not the boundary case. Recall in the KKT case 2, it is not at boundary, we have R_i=y_i\cdot E_i=0. Thus, we can reinforce the new value of E to be 0.

E(x,y)
=E(x,y)^{old}+\triangle E(x,y)
=E(x,y)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x-\triangle b
=0

Then, we have \triangle b in below expression:
\triangle b=E(x,y)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x

Take \triangle b=b^{new}-b^{old}, then:
b^{new}=E(x,y)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x+b^{old}

We can further deduce out that:
b_1^{new}=E(x_1,y_1)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x_1+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x_1+b^{old}
b_2^{new}=E(x_2,y_2)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x_2+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x_2+b^{old}

where
\triangle b_1=b_1^{new}-b^{old}
\triangle b_2=b_2^{new}-b^{old}
➀when \alpha_1^{new} (and, or, \alpha_2^{new}) is not at boundary, 0<\alpha_1^{new},\alpha_2^{new}<C, by above deduction, we have E(x_i,y_i)^{new}=0, for i=1,2, that is to say:
R_i=y_i\cdot E_i^{new}=y_i\cdot ((w^{new})^t\cdot x_i-b-y_i)\approx0
it holds for y_i=\pm, therefore, there should be no change in b, we can treat \triangle b=0.
\triangle b=0
\;\;\;\;=b_1^{new}-b_1^{old}
\;\;\;\;=b_2^{new}-b_2^{old}, further proof of this:
let b_1^{new}, b_2^{new} the bias term after iteration
(w^{new})^t\cdot x_1-b_1^{new}\approx0
(w^{new})^t\cdot x_2-b_2^{new}\approx0, where non-boundary case guarantees:

\left|(w^{new})^t\cdot x_i-y_i\right|\leq\varepsilon

, for i=1,2, and \varepsilon is a rather tiny quantity.
therefore, \varepsilon-b_1^{new}\approx\varepsilon-b_2^{new}
hence, we have b_1^{new}\approx b_2^{new} as the result.

➁when \alpha_1^{new}, \alpha_2^{new} are all at different boundary, one at 0, one at C, where L\neq H, then, the interval in between b_1^{new} and b_2^{new} are all constrained by the KKT case 1 and 3, respectively.

Be recalled that KKT case 1 has it that \alpha_i=0, R_i\geq0; and KKT case 3 has it that \alpha_i=C, R_i\leq0, the intersection with KKT case 2 is the equality of 0, R_i\approx0, to reinforce these 2 points entering into the support vector, we can just come out with E(x,y)^{new}=0. Therefore, we take b^{new}=\frac{b_1^{new}+b_2^{new}}{2}, such that the next b^{new} would be stable in this way, the very next time, for current evaluated or other points to be iterated over, it would be much easier to move toward the boundary than current.

On our way to update b, in the meanwhile, we update E_i.

[Updating F_i]
We have it that:
F(x,y)
=w^t\cdot x-y
=\sum_{i=1}^{n}\alpha_i\cdot y_i\cdot x_i^t\cdot x-y

Trivially, we also have it that:
\triangle F(x,y)
=\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x

[Updating w]
Under the assumption that we are using linear kernel,
begin from w=\sum_{i=1}^{n}\alpha_i\cdot y_i\cdot x_i,
we can have it that:
\triangle w=\triangle\alpha_1\cdot y_1\cdot x_1+\triangle\alpha_2\cdot y_2\cdot x_2