Gradient Descendent
07 Nov 2017Gradient Descendent
Begin from Cost Function in Simple Linear Regressoin(S.L.R)
take the cost function $J(\theta)=\frac1{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$, where the purpose of S.L.R is to minimize $J(\theta)$.
given $h_\theta(x)=\theta^t\cdot x$, where \(\theta=\begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}\) is the coefficients
take \(X=\begin{bmatrix}x_1\\x_2\end{bmatrix}\) as the input data
and $x_1=1$, it is the intercept, or the bias term.
as to the divide by 2m is an artificial design:
➀$\frac12$ is to eliminate the power of 2 in the square, after differentiation is taken.
➁$\frac1m$ is to average all squared errors of m input rows of data.
➂the superscribe(i) of $X$, $Y$ are the index of the input data record, means it is the i-th input data $X$, $Y$
Step into Gradient ∇ of $J(\theta)$
suppose we are given m records of input data, they are $X$ and $Y$. ➀take derivation of on , then we obtain:
since we are running batch gradient descendent, is required to average it.
➁if m = 1, then, we just have the j-th term as:
Learning Rate α
it is used to converge the result of gradient descendent. The smaller value would believed to have a less fluctuation.
Put It Together
put everything together, we will have the j-the term of $\theta$, it is inclusive of the bias term $\theta_1$:
given ,
and, we just deduce out:
Why the Gradient Descendent Works?
We will illustrate with cost function of multiple parameters:
➀given a multi-parameters cost function, the gradient ∇ of are ,
➁suppose
➂take
➃to get the minimum value of , we have to figure out by below function:, where is a convex function
➄the gradient descendent is the algorithm that can gradually lead to final , that can minimize
suppose we are beginning from some base point , , where the upper subscribe(0) is the iteration index, indicates that it is the 0-th iteration.
for =0 to some large value.
➀
➁
repeat above 2 steps in the same iteration, loop until the could be believed at its minimum or the iteration index i is large enough.next, we explain why we choose to minimize the gradient, actually, there exists 2 gradients, +∇ and −∇:
➀for +∇, the equation minus the positive gradient at learning rate α > 0, then,
the would step back to the optimal minimum.
➁for −∇, the equation minus the negative gradient at learning rate α > 0, then,
the would step forward to the optimal minimum.
both ➀, ➁ work for in this example, we thus prove it!!