mjtsai1974's Dev Blog Welcome to mjt's AI world

Support Vector Machine Initial Idea

Initial Idea

Now, we know the distance from H to H1 is |wtxb|; where H_1:w^t\cdot x-b=1. As to distance between H_1 and H_2, we must minimize the term \left\|w\right\|=(w^t\cdot w)^{\textstyle\frac12} to satisfy the condition that no points between H_1 and H_2.

Formulate Our Problem By Lagrange Multiplier

More precisely, by minimizing w^t\cdot w, we expect to have:
➀for positive examples, w^t\cdot x-b\geq1, where y=1.
➁for negative examples, w^t\cdot x-b\leq-1, where y=-1.
Then, combine ➀ and ➁ we can have:
y_i\cdot(w^t\cdot x_i-b)\geq1, \forall i\in\{1,2,...,n\}
We can formulate our problem as:
\underset w{min}\frac12w^t\cdot w, subject to y_i\cdot(w^t\cdot x_i-b)\geq1,\forall i.
The first part is the target, the second part is the constraint.

We introduce \alpha_1,\alpha_2,\dots\alpha_n to be the lagrange multiplier and express in below lagrangian to be our objective function:
L(w,b,\alpha)=\frac12w^t\cdot w-\sum_{i=1}^n\alpha_i\cdot(y_i\cdot(w^t\cdot x_i-b)-1)
;where L is actually the maximum likelihood function for w, \alpha, b to be estimated out the best optimal value at the corresponding largest possibility.

Regularize The Objective Function

Next to inspect inside L(w,b,\alpha) to get the best optimal value of w, \alpha, b:
\begin{array}{l}L(w,b,\alpha)\\=\frac12w^t\cdot w-\sum_{i=1}^n\alpha_i\cdot(y_i\cdot(w^t\cdot x_i-b)-1)\\=\frac12w^t\cdot w-\sum_{i=1}^n\alpha_i\cdot y_i\cdot(w^t\cdot x_i-b)+\sum_{i=1}^n\alpha_i\end{array}
To get the optimal \alpha, we should take partial derivatives of L on w, b respectively and equate them to zero:
\frac{\partial L}{\partial w}=w-\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i=0, then, we have w=\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i
\frac{\partial L}{\partial b}=\sum_{i=1}^n\alpha_i\cdot y_i
➂use the deduction result of ➀ and ➁ in L(w,b,\alpha), we obtain:
\begin{array}{l}L(w,b,\alpha)\\=\frac12(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)^t\cdot(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)\\-\sum_{i=1}^n\alpha_i\cdot y_i\cdot((\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)^t\cdot x_i-b)\\+\sum_{i=1}^n\alpha_i\end{array}
➃reduce from the second term, we have below deduction:
\begin{array}{l}\sum_{i=1}^n\alpha_i\cdot y_i\cdot((\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)^t\cdot x_i-b)\\=\sum_{i=1}^n\alpha_i\cdot y_i\cdot(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)^t\cdot x_i\\-\sum_{i=1}^n\alpha_i\cdot y_i\cdot b\\=(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)^t\cdot(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)\\\dots\sum_{i=1}^n\alpha_i\cdot y_i\cdot b=0\end{array}
➄take it back to L(w,b,\alpha), we get:
\begin{array}{l}L(w,b,\alpha)\\=-\frac12(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)^t\cdot(\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i)\\+\sum_{i=1}^n\alpha_i\\=-\frac12\sum_{i,j=1}^n\alpha_i\cdot\alpha_j\cdot y_i\cdot y_j\cdot(x_i^t\cdot x_j)\\+\sum_{i=1}^n\alpha_i\end{array}

Why L(w,b,\alpha) = the target plus the constraint?

But, why do we design L(w,b,\alpha) in the expression of the target plus the constraint? Recall that in the lagrange multiplier article, I have shown you by taking L(x,\lambda)=f(x)+\lambda\cdot g(x), finally leads to \lambda<0.

➀if we choose L(x,\lambda)=f(x)-\lambda\cdot g(x), then \lambda>0. This is by artificial design. Whether plus or minus sign is used in the expression, could we have the similar achievement.
➁now, back to the formulate of our problem:
\underset{w,b}{min}\frac12w^t\cdot w, subject to y_i\cdot(w^t\cdot x_i-b)\geq1
➂suppose we take the constraint function function to be g(x)=y_i\cdot(w^t\cdot x_i-b)\geq1, if \lambda would be negative by prior lagrangian proof, to get positive \lambda, we should use the minus operator in L(x,\lambda).
➃Therefore, almost all SVM articles design L(w,b,\alpha); where \alpha is the lagrange multiplier given in the expression by using the minus operator:
L(w,b,\alpha)=\frac12w^t\cdot w-\sum_{i=1}^n\alpha_i\cdot(y_i\cdot(w^t\cdot x_i-b)-1)
➄such design guarantees that we could have \forall\alpha_i>0, which would be one major condition that must be satisfied in the following SVM article.

After solving the \alpha_i, we can get w=\sum_{i=1}^n\alpha_i\cdot y_i\cdot x_i, then, we can classify a new object x by:
\begin{array}{l}f(x)\\=sign(w^t\cdot x-b)\\=sign((\sum_{i=1}^n\alpha_i\cdot y_i\cdot xi)^t\cdot x-b)\\=sign(\sum_{i=1}^n\alpha_i\cdot y_i\cdot(xi)^t\cdot x-b)\end{array}

Non-Linear Training Set

Here, we have a question, what, if the separating surface of the two classes is not linear?
This might imply that the training set is distributed in a non-linear pattern. Suggestion would be made to transform the training set into another high-dimensional(maybe) space, such that they could be linearly separable.
➀let \Phi(\cdot) to be the transformation function to some high-dimensional space, then:
\begin{array}{l}L(\alpha)\\=-\frac12\sum_{i,j=1}^n\alpha_i\cdot\alpha_j\cdot y_i\cdot y_j\cdot(\Phi(x_i).\times\Phi(x_j))\\+\sum_{i=1}^n\alpha_i\end{array}
, where k(x_i,x_j)=\Phi(x_i).\times\Phi(x_j), k(x_i,x_j) is the kernel function, the element-wise dot product \Phi^t(x_i).\times\Phi(x_j) in the high-dimensional space is equivalent to the kernel function of the space of the output(or maybe input) parameters.
➁we can take k(x_i,x_j)=e^{-(\frac{\left|x_i-x_j\right|^2}{2\cdot\delta^2})},
where, k(x_i,x_j)\in R^{n2},\Phi(x_i).\times\Phi(x_j)\in R^{n2}
x_i,x_j\in R^{n1} and n1\leq n2.