Binary and Multiple Classification
13 Nov 2017Classification
Binary Classification
Given input data of size n, binary classification is to classify it to be true or false, or identity of another alternative, 0 and 1, and still further.
Before we step into multiple classification, a full illustration of the logistic regression model is mandatory.
Why We Need The Logistic Regression Model?
Suppose we’d like to classify the given input data as 0 or 1, in example of tumor size identification of maligant or benight.
We are given m records of tumor history and come out the fitted linear regression line in grapg labeled ➀, where m-1 th record is identified as non-maligant(Y=0).
After record m+1, m+2 have been added to the given sample, we get the new fitted regression line in graph labeled ➁, watch out that m-1 th record is now identified as maligant(Y=1)!!!
Caution must be made that by tradition, the fitted linear regression line could be biased with the input sample and hθ(x)>1or<0 could happen.What we want for binary classification in this example is to classify
{Y=1,hθ(x)≥0.5Y=0,hθ(x)<0.5 The linear regression model is likely to have hθ(x)>1or<0; while the logistic regression model has 0≤hθ(x)≤1!!!
The Logistic Regression Function
It is also named as the sigmoid function, and defined as: hθ(x)=11+e−θt⋅x,where{θ∈Rn,Mn×1x∈Rn,Mn×1
We can then have hθ(x)=P(Y=1|x;θ),theprobablityofY=1givenxandθ
➀P(Y=1|x;θ)+P(Y=0|x;θ)=1
➁P(Y=1|x;θ)=1−P(Y=0|x;θ)
The Logistic Regression Cost Function
Can we re-using the cost function in linear regression model, J(θ)=12m∑mi=1(hθ(x(i))−y(i))2 to be the logistic version of cost function? The answer is no!!! Because the gradient descendent wouldn’t be a smooth convex curve.
The major purpose of the cost function is to reduce the error of the model and gradient descendent could then be applied on to get the θ that can really have the smallest error.
Why do we use log? Why to have a minus sign?
Two key points must be clarified:
➀when P(Y=1)≈1, the error from 1 for P(Y=1)≈0 ➁when P(Y=0)≈0, the error from 0 for P(Y=0)≈0 We thus define the cost function as:
{−log(hθ(x)),forY=1−log(1−hθ(x)),forY=0Please recall log1=0, whenever hθ(x)≈1 or hθ(x)≈0.
Next to formulize it,
J(θ)=1m∑mi=1cost(hθ(x(i)),y(i)),wherecost(hθ(x(i)),y(i))={−log(hθ(x(i))),fory(i)=1−log(1−hθ(x(i))),fory(i)=0
The graph exhibits some findings:
{cost(hθ(x(i)),y(i))=0,ifhθ(x(i))=y(i)cost(hθ(x(i)),y(i))≈∞,ifhθ(x(i))≈0,y(i)=1cost(hθ(x(i)),y(i))≈∞,ifhθ(x(i))≈1,y(i)=0
Now, we combine the two cases into one formula:
J(θ)=−1m∑mi=1[y(i)⋅log(hθ(x(i)))+(1−y(i))⋅log(1−hθ(x(i)))],where{y(i)=1,wehave1−y(i)=0y(i)=0,wehave1−y(i)=1 Thus, we chould have the 2 cases being operated in mutual exclusive flow.
Regularization of The Logistic Regression Cost Function
Please remind the need of regularization of cost function, or you might struggle in the overfitting embarrassed situation. We put an extra term at the end of existing cost function:
J(θ)REG=−1m∑mi=1[y(i)⋅log(hθ(x(i)))+(1−y(i))⋅log(1−hθ(x(i)))]+λ2m∑nj=1θ2j;whereθ∈Rn,Mn×1For the ease of understand, we illustrate by intuition:
➀∂J(θ)∂θj=1m∑mi=1(hθ(x(i))−y(i))⋅x(i)j,forj=1, the bias term, no need for regularization!!!
➁∂J(θ)∂θj=1m∑mi=1(hθ(x(i))−y(i))⋅x(i)j+λmθj,forj>1
;where the term hθ(x(i))−y(i) is just the intuitive concept, the proof should make the derivation on the regularized version of cost function, next, we take the action.Derivate the first part of cost function, we have below deduction:
J(θ)=−1m∑mi=1[y(i)⋅log(hθ(x(i)))+(1−y(i))⋅log(1−hθ(x(i)))];wherehθ(x)=11+e−θt⋅x,θ∈Rn,x∈Rn,∂J(θ)∂θj=1m∑mi=1[−y(i)⋅(hθ(x(i)))−1−(1−y(i))⋅(1−hθ(x(i)))−1⋅∂(1−hθ(x(i)))∂θj]We make further expression:
➀the first order partial differential of hθ(x(i)):
∂(hθ(x(i)))∂θj=∂(11+e−θt⋅x(i))∂θj=∂(1+e−θt⋅x(i))−1∂θj=(−1)⋅[1+e−θt⋅x(i)]−2⋅(−x(i)j⋅e−θt⋅x(i))=[1+e−θt⋅x(i)]−2⋅(x(i)j⋅e−θt⋅x(i))➁extend 1−hθ(x(i)) as:
1−hθ(x(i))=1−11+e−θt⋅x(i)=e−θt⋅x(i)1+e−θt⋅x(i)Take ➀ and ➁ into the first order partial differential of J(θ):
∂J(θ)∂θj=1m∑mi=1[−y(i)⋅(1+e−θt⋅x(i))⋅(1+e−θt⋅x(i))−2⋅(x(i)j⋅e−θt⋅x(i))+(1−y(i))⋅(1+e−θt⋅x(i)e−θt⋅x(i))⋅(1+e−θt⋅x(i))−2⋅(x(i)j⋅e−θt⋅x(i))]=1m∑mi=1[−y(i)⋅(1+e−θt⋅x(i))−1⋅(x(i)j⋅e−θt⋅x(i))+(1−y(i))⋅eθt⋅x(i)⋅(1+e−θt⋅x(i))−1⋅(x(i)j⋅e−θt⋅x(i))]=1m∑mi=1[−y(i)⋅(1+e−θt⋅x(i))−1⋅(e−θt⋅x(i)⋅x(i)j)+(1−y(i))⋅(1+e−θt⋅x(i))−1⋅x(i)j)]=1m∑mi=1[−y(i)⋅(e−θt⋅x(i)⋅x(i)j)+(1−y(i))⋅x(i)j)]⋅(1+e−θt⋅x(i))−1=1m∑mi=1[−y(i)⋅(e−θt⋅x(i)⋅x(i)j)+(1−y(i))⋅x(i)j)]⋅hθ(x(i))Finally, take the second part together in the derivative, we get the total solution:
∂J(θ)REG∂θ=1m∑mi=1[−x(i)⋅y(i)⋅e−θt⋅x(i)+x(i)⋅(1−y(i))]⋅hθ(x(i))+λmθ
Multiple Classification
We now migrate to multiple classification. There might exist more than one pattern in the given sampling, how do you plan to classify the object to the correct identity? We introduce the most often used method, one-versus-all.
Take h(k)θ(x(i))=P(y(i)=k|θ(k),x(i));k=1,2,3,i=1tom,thecountofdataset. Next to illustrate one-versus-all method:
➀train with h(1)θ(x(i)) to get the desired θ(1) vector for class 1.
➁train with h(2)θ(x(i)) to get the desired θ(2) vector for class 2.
➂train with h(3)θ(x(i)) to get the desired θ(3) vector for class 3.
After θ(k) has been figured out, for any new input of x(i), where i≥m, use h(k)θ(x(i)) to predict the output, and choose the maximum one as the object identity.
That is to say maxkh(k)θ(x(i)) as the identity of class k.