Gradient Descendent

07 Nov 2017

Gradient Descendent

Most widely used in machine learning with respect to linear regression, logistic regression for supervised learning. But, the gradient descendent itself actually knows nothing when to stop!!

Begin from Cost Function in Simple Linear Regressoin(S.L.R)

take the cost function $J(\theta)=\frac1{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$, where the purpose of S.L.R is to minimize $J(\theta)$.
given $h_\theta(x)=\theta^t\cdot x$, where $\theta=\begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}$ is the coefficients
take $X=\begin{bmatrix}x_1\\x_2\end{bmatrix}$ as the input data
and $x_1=1$, it is the intercept, or the bias term.
as to the divide by 2m is an artificial design:
➀$\frac12$ is to eliminate the power of 2 in the square, after differentiation is taken.
➁$\frac1m$ is to average all squared errors of m input rows of data.
➂the superscribe(i) of $X$, $Y$ are the index of the input data record, means it is the i-th input data $X$, $Y$

Step into Gradient ∇ of $J(\theta)$

suppose we are given m records of input data, they are $X$ and $Y$. ➀take derivation of $J (θ)$ on $θ_{j}$ , then we obtain:
$\frac{\partial J (θ)}{\partial θ_{j}} = \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$
since we are running batch gradient descendent, $\frac{1}{m}$ is required to average it.
➁if m = 1, then, we just have the j-th term as:
$\frac{\partial J (θ)}{\partial θ_{j}} = \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$

Learning Rate α

it is used to converge the result of gradient descendent. The smaller value would believed to have a less fluctuation.

Put It Together

put everything together, we will have the j-the term of $\theta$, it is inclusive of the bias term $\theta_1$:
$θ_{j} = θ_{j} - α \cdot \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}, w h e r e j = 1 t o n, θ \in R^{n}$
given $X \in M_{m \times n}, θ \in M_{n \times 1}, Y \in M_{m \times 1}, x^{(i)} \in M_{n \times 1}$ ,
$X = {[\begin{matrix} - (x^{(1)})^{t} - \\ - (x^{(2)})^{t} - \\ . . . \\ - (x^{(m)})^{t} - \end{matrix}]}_{m x n}$
and, we just deduce out:
$\frac{\partial J (θ)}{\partial θ_{j}} = \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} = ({[\begin{matrix} - (x^{(1)})^{t} - \\ - (x^{(2)})^{t} - \\ . . . \\ - (x^{(m)})^{t} - \end{matrix}]}_{m x n} \cdot {[\begin{matrix} θ_{1} \\ θ_{2} \\ . . . \\ θ_{n} \end{matrix}]}_{n \times 1} - {[\begin{matrix} y^{(1)} \\ y^{(2)} \\ . . . \\ y^{(m)} \end{matrix}]}_{m \times 1})^{t} \cdot [\begin{matrix} x_{j}^{(1)} \\ x_{j}^{(2)} \\ . . . \\ x_{j}^{(m)} \end{matrix}] \cdot \frac{1}{m}$

Why the Gradient Descendent Works?

We will illustrate with cost function of multiple parameters:

➀given $f (x_{1}, x_{2})$ a multi-parameters cost function, the gradient ∇ of $f$ are $\frac{\partial f}{\partial x_{1}}$ , $\frac{\partial f}{\partial x_{2}}$
➁suppose $y = f (x_{1}, x_{2})$
➂take ${\hat{ε}}^{2} = \frac{1}{2 m} \cdot \sum_{i = 1}^{m} (\hat{f} (x_{1}, x_{2})^{(i)} - y^{(i)})^{2}$
➃to get the minimum value of ${\hat{ε}}^{2}$ , we have to figure out by below function:
${\hat{ε}}^{2} = \underset{x_{1}, x_{2}}{a r g m i n} \frac{1}{2 m} \cdot \sum_{i = 1}^{m} (\hat{f} (x_{1}, x_{2})^{(i)} - y^{(i)})^{2}$
, where ${\hat{ε}}^{2}$ is a convex function
➄the gradient descendent is the algorithm that can gradually lead to final $x_{1}$ , $x_{2}$ that can minimize ${\hat{ε}}^{2}$

suppose we are beginning from some base point ${x_{1}}^{(0)}$ , ${x_{2}}^{(0)}$ , where the upper subscribe(0) is the iteration index, indicates that it is the 0-th iteration.

for $i$ =0 to some large value.
➀ ${x_{1}}^{(i)} = {x_{1}}^{(i - 1)} - α \cdot \frac{\partial f ({x_{1}}^{(i - 1)}, {x_{2}}^{(i - 1)})}{\partial x_{1}}$
➁ ${x_{2}}^{(i)} = {x_{2}}^{(i - 1)} - α \cdot \frac{\partial f ({x_{1}}^{(i - 1)}, {x_{2}}^{(i - 1)})}{\partial x_{2}}$
repeat above 2 steps in the same iteration, loop until the ${\hat{ε}}^{2}$ could be believed at its minimum or the iteration index i is large enough.

next, we explain why we choose to minimize the gradient, actually, there exists 2 gradients, +∇ and −∇:
➀for +∇, the equation minus the positive gradient at learning rate α > 0, then, ${x_{1}}^{(i)} = {x_{1}}^{(i - 1)} - α \cdot \nabla f_{x_{1}} ({x_{1}}^{(i - 1)}, {x_{2}}^{(i - 1)})$
the ${x_{1}}^{(i)}$ would step back to the optimal minimum.

➁for −∇, the equation minus the negative gradient at learning rate α > 0, then, ${x_{1}}^{(i)} = {x_{1}}^{(i - 1)} + α \cdot \nabla f_{x_{1}} ({x_{1}}^{(i - 1)}, {x_{2}}^{(i - 1)})$
the ${x_{1}}^{(i)}$ would step forward to the optimal minimum.

both ➀, ➁ work for $x_{2}$ in this example, we thus prove it!!

mjtsai1974's Dev Blog Welcome to mjt's AI world

Gradient Descendent

Gradient Descendent

Begin from Cost Function in Simple Linear Regressoin(S.L.R)

Step into Gradient ∇ of $J(\theta)$

Learning Rate α

Put It Together

Why the Gradient Descendent Works?

Related Posts

Partial Observable Markov Decision Process - Part 4 04 Dec 2020

Partial Observable Markov Decision Process - Part 3 10 Oct 2020

Partial Observable Markov Decision Process - Part 2 13 Aug 2020