mjtsai1974's Dev Blog Welcome to mjt's AI world

Binary and Multiple Classification

Classification

Classification could be categorized into two major variety, they are binary and multiple classifications. The major purpose of binary classification is to identify an object as two mutual exclusive identity by means of logistic regression. The multiple regression comes out with a wider rage of identity, might be 0, 1, 2, 3,...n, nevertheless, it is using one versus all, which is the same algorithm of binary classification.

Binary Classification

Given input data of size n, binary classification is to classify it to be true or false, or identity of another alternative, 0 and 1, and still further.
Before we step into multiple classification, a full illustration of the logistic regression model is mandatory.

Why We Need The Logistic Regression Model?

Suppose we’d like to classify the given input data as 0 or 1, in example of tumor size identification of maligant or benight.
We are given m records of tumor history and come out the fitted linear regression line in grapg labeled ➀, where m-1 th record is identified as non-maligant(Y=0).

After record m+1, m+2 have been added to the given sample, we get the new fitted regression line in graph labeled ➁, watch out that m-1 th record is now identified as maligant(Y=1)!!!

Caution must be made that by tradition, the fitted linear regression line could be biased with the input sample and hθ(x)>1or<0 could happen.

What we want for binary classification in this example is to classify
{Y=1,hθ(x)0.5Y=0,hθ(x)<0.5 The linear regression model is likely to have hθ(x)>1or<0; while the logistic regression model has 0hθ(x)1!!!

The Logistic Regression Function

It is also named as the sigmoid function, and defined as: hθ(x)=11+eθtx,where{θRn,Mn×1xRn,Mn×1

We can then have hθ(x)=P(Y=1|x;θ),theprobablityofY=1givenxandθ
P(Y=1|x;θ)+P(Y=0|x;θ)=1
P(Y=1|x;θ)=1P(Y=0|x;θ)

The Logistic Regression Cost Function

Can we re-using the cost function in linear regression model, J(θ)=12mmi=1(hθ(x(i))y(i))2 to be the logistic version of cost function? The answer is no!!! Because the gradient descendent wouldn’t be a smooth convex curve.

The major purpose of the cost function is to reduce the error of the model and gradient descendent could then be applied on to get the θ that can really have the smallest error.
Two key points must be clarified:
➀when P(Y=1)1, the error from 1 for P(Y=1)0 ➁when P(Y=0)0, the error from 0 for P(Y=0)0 We thus define the cost function as:
{log(hθ(x)),forY=1log(1hθ(x)),forY=0

Why do we use log? Why to have a minus sign?

Please recall log1=0, whenever hθ(x)1 or hθ(x)0.

Next to formulize it,
J(θ)=1mmi=1cost(hθ(x(i)),y(i)),wherecost(hθ(x(i)),y(i))={log(hθ(x(i))),fory(i)=1log(1hθ(x(i))),fory(i)=0

The graph exhibits some findings:
{cost(hθ(x(i)),y(i))=0,ifhθ(x(i))=y(i)cost(hθ(x(i)),y(i)),ifhθ(x(i))0,y(i)=1cost(hθ(x(i)),y(i)),ifhθ(x(i))1,y(i)=0

Now, we combine the two cases into one formula:
J(θ)=1mmi=1[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))],where{y(i)=1,wehave1y(i)=0y(i)=0,wehave1y(i)=1 Thus, we chould have the 2 cases being operated in mutual exclusive flow.

Regularization of The Logistic Regression Cost Function

Please remind the need of regularization of cost function, or you might struggle in the overfitting embarrassed situation. We put an extra term at the end of existing cost function:
J(θ)REG=1mmi=1[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]+λ2mnj=1θ2j;whereθRn,Mn×1

For the ease of understand, we illustrate by intuition:
J(θ)θj=1mmi=1(hθ(x(i))y(i))x(i)j,forj=1, the bias term, no need for regularization!!!
J(θ)θj=1mmi=1(hθ(x(i))y(i))x(i)j+λmθj,forj>1
;where the term hθ(x(i))y(i) is just the intuitive concept, the proof should make the derivation on the regularized version of cost function, next, we take the action.

Derivate the first part of cost function, we have below deduction:
J(θ)=1mmi=1[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))];wherehθ(x)=11+eθtx,θRn,xRn,J(θ)θj=1mmi=1[y(i)(hθ(x(i)))1(1y(i))(1hθ(x(i)))1(1hθ(x(i)))θj]

We make further expression:
➀the first order partial differential of hθ(x(i)):
(hθ(x(i)))θj=(11+eθtx(i))θj=(1+eθtx(i))1θj=(1)[1+eθtx(i)]2(x(i)jeθtx(i))=[1+eθtx(i)]2(x(i)jeθtx(i))

➁extend 1hθ(x(i)) as:
1hθ(x(i))=111+eθtx(i)=eθtx(i)1+eθtx(i)

Take ➀ and ➁ into the first order partial differential of J(θ):
J(θ)θj=1mmi=1[y(i)(1+eθtx(i))(1+eθtx(i))2(x(i)jeθtx(i))+(1y(i))(1+eθtx(i)eθtx(i))(1+eθtx(i))2(x(i)jeθtx(i))]=1mmi=1[y(i)(1+eθtx(i))1(x(i)jeθtx(i))+(1y(i))eθtx(i)(1+eθtx(i))1(x(i)jeθtx(i))]=1mmi=1[y(i)(1+eθtx(i))1(eθtx(i)x(i)j)+(1y(i))(1+eθtx(i))1x(i)j)]=1mmi=1[y(i)(eθtx(i)x(i)j)+(1y(i))x(i)j)](1+eθtx(i))1=1mmi=1[y(i)(eθtx(i)x(i)j)+(1y(i))x(i)j)]hθ(x(i))

Finally, take the second part together in the derivative, we get the total solution:
J(θ)REGθ=1mmi=1[x(i)y(i)eθtx(i)+x(i)(1y(i))]hθ(x(i))+λmθ

Multiple Classification

We now migrate to multiple classification. There might exist more than one pattern in the given sampling, how do you plan to classify the object to the correct identity? We introduce the most often used method, one-versus-all.
Take h(k)θ(x(i))=P(y(i)=k|θ(k),x(i));k=1,2,3,i=1tom,thecountofdataset. Next to illustrate one-versus-all method:

➀train with h(1)θ(x(i)) to get the desired θ(1) vector for class 1.

➁train with h(2)θ(x(i)) to get the desired θ(2) vector for class 2.

➂train with h(3)θ(x(i)) to get the desired θ(3) vector for class 3.

After θ(k) has been figured out, for any new input of x(i), where im, use h(k)θ(x(i)) to predict the output, and choose the maximum one as the object identity.

That is to say maxkh(k)θ(x(i)) as the identity of class k.