mjtsai1974's Dev Blog Welcome to mjt's AI world

Gradient Descendent

Gradient Descendent

Most widely used in machine learning with respect to linear regression, logistic regression for supervised learning. But, the gradient descendent itself actually knows nothing when to stop!!

Begin from Cost Function in Simple Linear Regressoin(S.L.R)

take the cost function $J(\theta)=\frac1{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$, where the purpose of S.L.R is to minimize $J(\theta)$.
given $h_\theta(x)=\theta^t\cdot x$, where \(\theta=\begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}\) is the coefficients
take \(X=\begin{bmatrix}x_1\\x_2\end{bmatrix}\) as the input data
and $x_1=1$, it is the intercept, or the bias term.
as to the divide by 2m is an artificial design:
➀$\frac12$ is to eliminate the power of 2 in the square, after differentiation is taken.
➁$\frac1m$ is to average all squared errors of m input rows of data.
➂the superscribe(i) of $X$, $Y$ are the index of the input data record, means it is the i-th input data $X$, $Y$

Step into Gradient ∇ of $J(\theta)$

suppose we are given m records of input data, they are $X$ and $Y$. ➀take derivation of J(θ) on θj, then we obtain:

J(θ)θj=1mi=1m(hθ(x(i))-y(i))·xj(i)

since we are running batch gradient descendent, 1m is required to average it.
➁if m = 1, then, we just have the j-th term as:

J(θ)θj=i=1m(hθ(x(i))-y(i))·xj(i)

Learning Rate α

it is used to converge the result of gradient descendent. The smaller value would believed to have a less fluctuation.

Put It Together

put everything together, we will have the j-the term of $\theta$, it is inclusive of the bias term $\theta_1$:

θj=θj-α·1mi=1m(hθ(x(i))-y(i))·xj(i), where j=1 to n, θRn

given XMm×n, θMn×1, YMm×1, x(i)Mn×1,

X=-(x(1))t--(x(2))t-...-(x(m))t-mxn

and, we just deduce out:

J(θ)θj=1mi=1m(hθ(x(i))-y(i))·xj(i)=(-(x(1))t--(x(2))t-...-(x(m))t-mxn·θ1θ2...θnn×1-y(1)y(2)...y(m)m×1)t·xj(1)xj(2)...xj(m)·1m

Why the Gradient Descendent Works?

We will illustrate with cost function of multiple parameters:

➀given f(x1,x2) a multi-parameters cost function, the gradient ∇ of f are fx1, fx2
➁suppose y=f(x1,x2)
➂take ε^2=12m·i=1m(f^(x1,x2)(i)-y(i))2
➃to get the minimum value of ε^2, we have to figure out by below function:

ε^2=argminx1,x212m·i=1m(f^(x1,x2)(i)-y(i))2

, where  ε^2 is a convex function
➄the gradient descendent is the algorithm that can gradually lead to final x1, x2 that can minimize  ε^2

suppose we are beginning from some base point x1(0), x2(0), where the upper subscribe(0) is the iteration index, indicates that it is the 0-th iteration.

for i=0 to some large value.
x1(i)=x1(i-1)-α·f(x1(i-1), x2(i-1))x1
x2(i)=x2(i-1)-α·f(x1(i-1), x2(i-1))x2
repeat above 2 steps in the same iteration, loop until the ε^2 could be believed at its minimum or the iteration index i is large enough.

next, we explain why we choose to minimize the gradient, actually, there exists 2 gradients, +∇ and −∇:
➀for +∇, the equation minus the positive gradient at learning rate α > 0, then, x1(i)=x1(i-1)-α·fx1(x1(i-1),x2(i-1))
the x1(i) would step back to the optimal minimum.

➁for −∇, the equation minus the negative gradient at learning rate α > 0, then, x1(i)=x1(i-1)+α·fx1(x1(i-1),x2(i-1))
the x1(i) would step forward to the optimal minimum.

both ➀, ➁ work for x2 in this example, we thus prove it!!