Inside Sequential Minimal Optimization
07 Dec 2017Inside Sequential Minimal Optimization
Departure With Noise
We’d like to begin from the imperfect separation, since no sampling is fully qualified and could be well separated. Given the support vector formulated problem expressed in terms of αi, yi, xi; then the objective function:
L(w,b,ξ,α,μ)=∑ni=1αi−12⋅∑ni,j=1αi⋅αj⋅yi⋅yj⋅xti⋅xj
for all i, 0<αi<C, the non-boundary case.
Works On 2 α’s At A Time
The algorithm of SMO works by manipulating 2 α's at a time(with others fixed), a little hill climbing alike approach. By heuristics to choose 2 α’s at a time.
In the begin of the whole flow, initialize αi=0 for i=1 to n. Suppose we randomly choose α1, α2 and denote it as αold1, αold2.
Due to ∑ni=1αi⋅yi=0, therefore we have:
y1⋅α1+y2⋅α2=y1⋅αold1+y2⋅αold2
This confines the optimization of α1, α2 is on a line.Next to examine the relation in between of these 2 α1, α2 of old or new.
➀for y1≠y2, then, α1−α2=r, where r is a constant.
➁for y1=y2, then, α1+α2=r.
Take S=y1⋅y2, multiply y1⋅α1+y2⋅α2=const by y1, then,
α1+S⋅α2=const, we have:
α1=const−S⋅α2,
where α1+S⋅α2=αold1+S⋅αold2.Next to optimize α1, α2 with other α's fixed, the objective function could be rewritten as:
L(w,b,ξ,α,μ)=α1+α2+const−12⋅(part1+par2)➀we further express part1:
part1=∑i,j=1,2αi⋅αj⋅yi⋅yj⋅xti⋅xj=α21⋅y21⋅xt1⋅x1+α22⋅y22⋅xt2⋅x2+2⋅α1⋅α2⋅y1⋅y2⋅xt1⋅x2➁we express part2 as:
part2=∑j=ni=3;j=1,2αi⋅αj⋅yi⋅yj⋅xti⋅xj+∑j=ni=1,2;j=3αi⋅αj⋅yi⋅yj⋅xti⋅xj=2⋅(∑ni=3αi⋅yi⋅xti)⋅(α1⋅y1⋅x1+α2⋅y2⋅x2)
Express α1 In Terms Of α2
Next to express α1 in terms of α2, this is a quiet painful and confused process, it took me some while, now I’d like to share you with my viewpoint in each step in the deduction.
From current objective function, some symptoms could be found that xi is associasted with xj.
Thus, we take K11=xt1⋅x1, K22=xt2⋅x2, K12=xt1⋅x2.Recall that it is deduced out by removing the old α1, α2 associated terms from the term w, here comes the question, can we relate the old w to the new evolved α1, α2?
Vj=∑ni=3αi⋅yi⋅xti⋅xj=(wold)t⋅xj−αold1⋅y1⋅xt1⋅xj−αold2⋅y2⋅xt2⋅xj=(wold)t⋅xj−bold+bold−αold1⋅y1⋅xt1⋅xj−αold2⋅y2⋅xt2⋅xj=Uj+bold−αold1⋅y1⋅xt1⋅xj−αold2⋅y2⋅xt2⋅xjwhere Uj=(wold)t⋅xj−bold is the output of xj under the old parameters, wold, bold. Expression in this way with a hope to refine original objective function.
In order to make it the point, the subscript of the term indicates the index, usually in this proof, they are i,j or 1,2…, the superscript is ued for the identity of old or new clipped term.
For the simplicity of the deduction and understanding, if no any word like old, new in the superscript, then, the term is treated as new term. Why by default is the term of new version? Because the nature design in SMO algorithm works by clipping 2 new terms of α’s at a time, the objective function would then be refined as the expression of 2 new terms of α’s.
➀deduce by replacing above terms in the objective function:
L(w,b,ξ,α,μ)=α1+α2+const−12⋅(K11⋅α21+K22⋅α22+2⋅S⋅K12⋅α1⋅α2+2⋅α1⋅y1⋅V1+2⋅α2⋅y2⋅V2)
➁next to replace α1 with α2, more precisely, replace the new α1 with the new α2:
=(r−S⋅α2)+α2+const−12⋅(K11⋅(r−S⋅α2)2+K22⋅α22+2⋅S⋅K12⋅(r−S⋅α2)⋅α2+2⋅y1⋅V1⋅(r−S⋅α2)+2⋅α2⋅y2⋅V2)
➂toss out the r in the term (r−S⋅α2)(or we just put it in the term const), since it is just a constant and no association with it, as to the r associated in the remaining terms, it should be retained.
=(1−S)⋅α2+const−12⋅(K11⋅(r−S⋅α2)2+K22⋅α22+2⋅S⋅K12⋅(r−S⋅α2)⋅α2+2⋅y1⋅V1⋅(r−S⋅α2)+2⋅α2⋅y2⋅V2)
➃expand in terms of α2:
=(1−S)⋅α2+const−12⋅K11⋅r2+K11⋅r⋅S⋅α2−12⋅K11⋅S2⋅α22−12⋅K22⋅α22−S⋅K12⋅r⋅α2+S2⋅K12⋅α22−y1⋅V1⋅r+y1⋅V1⋅S⋅α2−y2⋅V2⋅α2
➄since we’d like to express α1 in terms of α2 in the new evolved objective function, it might be a good idea to put the r non-associated with α2 into the const term, where S2=1 and the −12⋅K11⋅r2, −y1⋅V1⋅r should be tossed out:
=(1−S)⋅α2+const+K11⋅r⋅S⋅α2−12⋅K11⋅S2⋅α22−12⋅K22⋅α22−S⋅K12⋅r⋅α2+S2⋅K12⋅α22+y1⋅V1⋅S⋅α2−y2⋅V2⋅α2
➅it seems that the objective function has been well refined with only α2 in it, but, take a look at the term y1⋅V1⋅S⋅α2, it consists of y1 and α2, whereas, y1 is the signal of α1, and α2 should be associated with y2, here comes the tricky factor S=y1⋅y2:
y1⋅V1⋅S⋅α2=y1⋅V1⋅y1⋅y2⋅α2=V1⋅y2⋅α2
where y21=1, thus, the regularized objective function in this paragraph would be:
=(1−S)⋅α2+const+K11⋅r⋅S⋅α2−12⋅K11⋅S2⋅α22−12⋅K22⋅α22−S⋅K12⋅r⋅α2+S2⋅K12⋅α22+y2⋅V1⋅α2−y2⋅V2⋅α2
➆now, we save with only the αnew2 and (αnew2)2 are left:
=12⋅(2K12−K11−K22)⋅(αnew2)2+(1−S+S⋅K11⋅r−S⋅K12⋅r+y2⋅V1−y2⋅V2)⋅αnew2
Relate αnew2 Back To αold2
The major spirit of SMO is to optimize 2 α’s at a time, by transiting from the old α’s to the new α’s, more precisely, from αold1 to αnew1, and from αold2 to αnew2. Now, we focus on αnew2, and laterly, by the relationship in between these 2 α’s:
➀α1+α2=r
➁α1−α2=r
above equality holds for both the old α’s to the new α’s, we can get the αnew1 after we get αnew2.To link αnew2 back to αold2 ,we start below procedure.
➀now, we focus on the coefficient of αnew2, expand from r and V1 and V2:
1−S+S⋅K11⋅r−S⋅K12⋅r+y2⋅V1−y2⋅V2=1−S+S⋅K11⋅(αold1+S⋅αold2)−S⋅K12⋅(αold1+S⋅αold2)+y2⋅(Uold1+bold−αold1⋅y1⋅K11−αold2⋅y2⋅K12)−y2⋅(Uold2+bold−αold1⋅y1⋅K12−αold2⋅y2⋅K22)
➁further expand:
=1−S+S⋅K11⋅αold1+S⋅K11⋅S⋅αold2−S⋅K12⋅αold1−S⋅K12⋅S⋅αold2+y2⋅Uold1+y2⋅bold−y2⋅αold1⋅y1⋅K11−y2⋅αold2⋅y2⋅K12−y2⋅Uold2−y2⋅bold+y2⋅αold1⋅y1⋅K12+y2⋅αold2⋅y2⋅K22
➂eliminate these terms S⋅K11⋅αold1, S⋅K12⋅αold1 and y2⋅bold:➃we obtain below equality:
=1−S+(K11−2⋅K12+K22)⋅αold2+y2⋅(Uold1−Uold2)
, where Uoldj=(wold)t⋅xj−bold.
➄enrich the equation with terms that is related to α2 might be helpful:
=(y2)2−y1⋅y2+(K11−2⋅K12+K22)⋅αold2+y2⋅(Uold1−Uold2)
, where (y2)2=1 and y1⋅y2=S.
Introduction Of η
Take a look at above ➄, there exists a term (K11−2⋅K12+K22) quiet similar with the term (2K12−K11−K22), which is the coefficient of the term (αnew2)2.
Next is to further refine the objective funection by introducing η.
Take η=2K12−K11−K22, then, the above ➄ becomes:
=(y2)2−y1⋅y2−η⋅αold2+y2⋅(Uold1−Uold2) =y2⋅(y2−y1+Uold1−Uold2)−η⋅αold2
=y2⋅(Uold1−y1−(Uold2−y2))−η⋅αold2
=y2⋅(Eold1−Eold2)−η⋅αold2
where Eold1=Errold(x1,y1)=(wold)t⋅x1−bold−y1,
and Eold2=Errold(x2,y2)=(wold)t⋅x2−bold−y2.
They are just the error caused by the wold, when αold1 and αold2 in it.Now, we formularize our problem in objective function of η, αnew2, αold2, Eold1, Eold2:
Transform from L(w,b,ξ,α,μ) to L(αnew2) is much simpler in its optimization
L(w,b,ξ,α,μ)=L(αnew2)=12⋅η⋅(αnew2)2+(y2⋅(Eold1−Eold2)−η⋅αold2)⋅αnew2+const, since, only αnew2 is left to be optimized, the regularization cost of computation is greatly reduced!!!
Now we come to the validity of η, to get the most optimal vale of αnew2:
➀∂L∂αnew2=η⋅αnew2+y2⋅(Eold1−Eold2)−η⋅αold2=0
➁∂L∂(αnew2)2=12⋅η, where we have η=2K12−K11−K22.By ➀, this coincides with our departure point in the section relate αnew2 back to αold2, we just make the point. We have:
η≤0 is the validity of η
αnew2=αold2−y2⋅(Eold1−Eold2)η
Further refine, we can have:
αnew2=αold2+y2⋅(Eold2−Eold1)ηand must hold:
For computation, \eta should be less than 0
η=2⋅K12−K11−K22=−‖, although it might be approaching to 0.
Feasible Rangle Of New \alpha Value
Suppose we are working under \eta<0 in this equality, and it is correct:
\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta
Then, the \alpha_2^{new} thus obtained might be unconstrainted maximum value. The feasible range of \alpha_2^{new} must be checked.We’d like to examine 4 cases, begin from below 3 equality:
➀S=y_1\cdot y_2.
➁\alpha_1^{old}+S\cdot \alpha_2^{old}=r=\alpha_1^{new}+S\cdot \alpha_2^{new}.
➂0\leq\alpha_i\leq C, this should be inequality.[Case 1]S=1, \alpha_1+\alpha_2=r, r>C, then, max(\alpha_2)=C, min(\alpha_2)=r-C
[Case 2]S=1, \alpha_1+\alpha_2=r, r<C, then, max(\alpha_2)=r, min(\alpha_2)=0
Where the case 1 and case 2 are found to have y_1=y_2, we’d like to get the low threshold by max(0, r-c), denote it as L; the hight threshold by min(r, C), denote it as H.
[Case 3]S=-1, \alpha_1-\alpha_2=r, r>0, then, max(\alpha_2)=C-r, min(\alpha_2)=0
[Case 4]S=-1, \alpha_1-\alpha_2=r, r<0, then, max(\alpha_2)=C, min(\alpha_2)=-r
Where the case 3 and case 4 are found to have y_1\neq y_2, we’d like to get the low threshold by max(0, -r), denote it as L; the hight threshold by min(C, C+r), denote it as H.
Take the minimum feasible of \alpha_2^{new} be L, the maximum feasible of \alpha_2^{new} be H, then:
\alpha_2^{new,clipped}=\left\{\begin{array}{l}L,if\;\alpha_2^{new}<L\\\alpha_2^{new},if\;L<\alpha_2^{new}<H\\H,if\;\alpha_2^{new}>H\end{array}\right.
Clip New \alpha
Succeeding to above section, in this paragraph, I’d like to summarize the process to clip \alpha_2^{new}, then, \alpha_1^{new}.
Begin by given y_1, y_2, K_{11}, K_{22}, K_{12}, E_1^{old}, E_2^{old}, where E_i^{old}=w^{old}\cdot x_i-b^{old}-y_i.
➀take \eta=2\cdot K_{12}-K_{11}-K_{22}, then \eta\leq 0 must hold.➁if \eta<0, by prior induction, we have:
\alpha_2^{new}=\alpha_2^{old}+\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta
Take \triangle\alpha_2=\frac{y_2\cdot(E_2^{old}-E_1^{old})}\eta
Please recall that \alpha_1=r-S\cdot \alpha_2, it holds for both new and old. Therefore,
\lim_{\triangle\rightarrow0}\frac{(\alpha_1+\triangle)-\triangle}\triangle=\lim_{\triangle\rightarrow0}\frac{r-S\cdot(\alpha_2+\triangle)-(r-S\cdot\alpha_2)}\triangle
\Rightarrow\triangle\alpha_1=-S\cdot\lim_{\triangle\rightarrow0}\frac{(\alpha_2+\triangle)-\alpha_2}\triangle
\Rightarrow\triangle\alpha_1=-S\cdot\triangle\alpha_2
, where \triangle\alpha_1, \triangle\alpha_2 are the differentials or derivatives of \alpha_1, \alpha_2.➂if \eta=0, most SMO dosument, just commends to evaluate the objective function at the 2 end points and set \alpha_2^{new} to the one that can make the larger objective function, thus, \alpha_2^{new} would be more closed to boundary!!!
Recap the objective function is now:
\begin{array}{l}L(\alpha_2^{new})\\=\frac12\cdot\eta\cdot(\alpha_2^{new})^2\\\;\;\;\;+(y_2\cdot(E_1^{old}-E_2^{old})-\eta\cdot\alpha_2^{old})\cdot\alpha_2^{new}\\\;\;\;\;+const\end{array}Where mjtsai think, for \eta=0, alternative would be to abandom and switch to next set of 2 points, since the current evaluated 2 points might just be the duplicated or the overlapped cases(Coursera, M.L professor ANG’s SMO sample code does this).
As to the mirrored case, should keep them in the mutual exclusive list for later manipulation, once new iteration would like to choose the new 2 candidates, already paired 2 points should not be paired again, also by mjtsai.
SMO Updating
We now have the concept that \alpha_1^{new}, \alpha_2^{new} are updated by \triangle\alpha_1, \triangle\alpha_2, this section would like to guide you to update E_i, F_i, w and b.
[Updating b]
Take E(x,y)=\sum_{i=1}^n\alpha_i\cdot y_i\cdot\ x_i^t\cdot x-y-b to be the predict error:
\triangle E(x,y)=\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x-\triangle b just holds to be the change in E.If 0<\alpha_1^{new}<C, then, we can treat the new value of E_1 be zero, that is E_1^{new}=0, since it is not the boundary case. Recall in the KKT case 2, it is not at boundary, we have R_i=y_i\cdot E_i=0. Thus, we can reinforce the new value of E to be 0.
E(x,y)
=E(x,y)^{old}+\triangle E(x,y)
=E(x,y)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x-\triangle b
=0Then, we have \triangle b in below expression:
\triangle b=E(x,y)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot xTake \triangle b=b^{new}-b^{old}, then:
b^{new}=E(x,y)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x+b^{old}We can further deduce out that:
b_1^{new}=E(x_1,y_1)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x_1+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x_1+b^{old}
b_2^{new}=E(x_2,y_2)^{old}+\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x_2+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x_2+b^{old}where
\left|(w^{new})^t\cdot x_i-y_i\right|\leq\varepsilon
\triangle b_1=b_1^{new}-b^{old}
\triangle b_2=b_2^{new}-b^{old}
➀when \alpha_1^{new} (and, or, \alpha_2^{new}) is not at boundary, 0<\alpha_1^{new},\alpha_2^{new}<C, by above deduction, we have E(x_i,y_i)^{new}=0, for i=1,2, that is to say:
R_i=y_i\cdot E_i^{new}=y_i\cdot ((w^{new})^t\cdot x_i-b-y_i)\approx0
it holds for y_i=\pm, therefore, there should be no change in b, we can treat \triangle b=0.
\triangle b=0
\;\;\;\;=b_1^{new}-b_1^{old}
\;\;\;\;=b_2^{new}-b_2^{old}, further proof of this:
let b_1^{new}, b_2^{new} the bias term after iteration
(w^{new})^t\cdot x_1-b_1^{new}\approx0
(w^{new})^t\cdot x_2-b_2^{new}\approx0, where non-boundary case guarantees:, for i=1,2, and \varepsilon is a rather tiny quantity.
therefore, \varepsilon-b_1^{new}\approx\varepsilon-b_2^{new}
hence, we have b_1^{new}\approx b_2^{new} as the result.➁when \alpha_1^{new}, \alpha_2^{new} are all at different boundary, one at 0, one at C, where L\neq H, then, the interval in between b_1^{new} and b_2^{new} are all constrained by the KKT case 1 and 3, respectively.
Be recalled that KKT case 1 has it that \alpha_i=0, R_i\geq0; and KKT case 3 has it that \alpha_i=C, R_i\leq0, the intersection with KKT case 2 is the equality of 0, R_i\approx0, to reinforce these 2 points entering into the support vector, we can just come out with E(x,y)^{new}=0. Therefore, we take b^{new}=\frac{b_1^{new}+b_2^{new}}{2}, such that the next b^{new} would be stable in this way, the very next time, for current evaluated or other points to be iterated over, it would be much easier to move toward the boundary than current.
On our way to update b, in the meanwhile, we update E_i.
[Updating F_i]
We have it that:
F(x,y)
=w^t\cdot x-y
=\sum_{i=1}^{n}\alpha_i\cdot y_i\cdot x_i^t\cdot x-yTrivially, we also have it that:
\triangle F(x,y)
=\triangle\alpha_1\cdot y_1\cdot x_1^t\cdot x+\triangle\alpha_2\cdot y_2\cdot x_2^t\cdot x[Updating w]
Under the assumption that we are using linear kernel,
begin from w=\sum_{i=1}^{n}\alpha_i\cdot y_i\cdot x_i,
we can have it that:
\triangle w=\triangle\alpha_1\cdot y_1\cdot x_1+\triangle\alpha_2\cdot y_2\cdot x_2