mjtsai1974's Dev Blog Welcome to mjt's AI world

Jacobian and Integral

Jacobian

Just like lagrange multiplier as a great add-in of constraint in differential to reach the local optimal. Jacobian is treated as a ratio in between two coordinate system of references transformed in calculus integration process by often. Before proceeding any further, take a good look in this article would be quiet helpful.

What Is Jacobian?

➀Begin by a simple example, given $\int_a^bf(x)\operatorname dx$:
\(\begin{array}{l}take\;x=h(y),\;differntiae\;x\;on\;y,\\\Rightarrow\frac{\operatorname dx}{\operatorname dy}=Dh(y),\\\Rightarrow x=h(y)=Dh(y)\cdot\operatorname dy\\\end{array}\)

➁then, we have below deduction:
\(\begin{array}{l}\int_a^bf(x)\operatorname dx=\int_c^df(h(y))\cdot Dh(y)\operatorname dy,\\where\;c=h(a),d=h(b),\operatorname dx=Dh(y)\cdot\operatorname dy\end{array}\)

➂we can say that the $Dh(y)$ is the ratio from $\operatorname dx$ to $\operatorname dy$, it is the jacobian.

Normalize The Jacobian

Next, this article would enter into the field a little bit deeper by double integral to prove the formula:
\(\iint\limits_Rf(x,y)\operatorname dx\operatorname dy=\iint\limits_Sf(h(u,v),k(u,v))\cdot\left|J\right|\operatorname du\operatorname dv\)

Let me give you a hint this proof is just to relate two sides of above equation with $\left|J\right|$, which is just a ratio due to system of coordinate of reference transformation from $x$, $y$ to $u$, $v$.
proof:
Take $x=h(u,v)$ and $y=k(u,v)$, we will prove by illustration:

➀suppose we are transforming from coordinate system <$x$, $y$> to <$u$, $v$>, and the surface area $R$ should be equal to the surface area $S$. We just have $area(R)=area(S)=\overset\rightharpoonup{MN}\otimes\overset\rightharpoonup{MQ}$, where $\otimes$ means the cross product.
\(\begin{array}{l}\overset\rightharpoonup{MN}=\left\langle h(u+\triangle u,v)-h(u,v),k(u+\triangle u,v)-k(u,v)\right\rangle;\\where\;\triangle u\approx0,then\;we\;have\\\lim_{\triangle u\rightarrow0}\frac{h(u+\triangle u,v)-h(u,v)}{\triangle u}=\frac{\partial{h(u,v)}}{\partial u}\\\lim_{\triangle u\rightarrow0}\frac{k(u+\triangle u,v)-k(u,v)}{\triangle u}=\frac{\partial{k(u,v)}}{\partial u}\end{array}\)

➁to get back to $\overset\rightharpoonup{MN}$, we need to multiply by $\triangle u$ for both $h$ and $k$ terms. Then, we have:
\(\overset\rightharpoonup{MN}=\left\langle\frac{\partial{h(u,v)}}{\partial u}\cdot\triangle u,\frac{\partial{k(u,v)}}{\partial u}\cdot\triangle u\right\rangle\)

➂above deduction holds for $\overset\rightharpoonup{MQ}$, apply it would we get similar result:
\(\overset\rightharpoonup{MQ}=\left\langle\frac{\partial{h(u,v)}}{\partial v}\cdot\triangle v,\frac{\partial{k(u,v)}}{\partial v}\cdot\triangle v\right\rangle\)

➃thus, take the absolute value of $area(R)=area(S)=\overset\rightharpoonup{MN}\otimes\overset\rightharpoonup{MQ}$.

➄continue to illustrate by geometric cross product:
\(\begin{array}{l}\begin{vmatrix}i&j&k\\\frac{\partial{h(u,v)}}{\partial u}\cdot\triangle u&\frac{\partial{k(u,v)}}{\partial u}\cdot\triangle u&0\\\frac{\partial{h(u,v)}}{\partial v}\cdot\triangle v&\frac{\partial{k(u,v)}}{\partial v}\cdot\triangle v&0\end{vmatrix}\\=\frac{\partial{h(u,v)}}{\partial u}\cdot\triangle u\cdot\frac{\partial{k(u,v)}}{\partial v}\cdot\triangle v-\frac{\partial{h(u,v)}}{\partial v}\cdot\triangle v\cdot\frac{\partial{k(u,v)}}{\partial u}\cdot\triangle u\\=\begin{bmatrix}\frac{\partial{h(u,v)}}{\partial u}\cdot\frac{\partial{k(u,v)}}{\partial v}-\frac{\partial{h(u,v)}}{\partial v}\cdot\frac{\partial{k(u,v)}}{\partial u}\end{bmatrix}\cdot\triangle u\cdot\triangle v\end{array}\)

➅make one more step, we can take the first part to be $\left|J\right|$ expressed in terms of $2\times2$ determinant:
\(\begin{array}{l}\begin{bmatrix}\frac{\partial{h(u,v)}}{\partial u}\cdot\frac{\partial{k(u,v)}}{\partial v}-\frac{\partial{h(u,v)}}{\partial v}\cdot\frac{\partial{k(u,v)}}{\partial u}\end{bmatrix}\\={\begin{vmatrix}\frac{\partial{h(u,v)}}{\partial u}&\frac{\partial{k(u,v)}}{\partial u}\\\frac{\partial{h(u,v)}}{\partial v}&\frac{\partial{k(u,v)}}{\partial v}\end{vmatrix}}_{2\times2\;detmiant}\\=\begin{vmatrix}J\end{vmatrix}\end{array}\)

➆finally, we have conclusion below:
\(\begin{array}{l}area(R)\\=\triangle x\cdot\triangle y\\\cong\operatorname dx\cdot\operatorname dy\\\cong\left|J\right|\cdot\triangle u\cdot\triangle v\\\cong\left|J\right|\cdot\operatorname du\cdot\operatorname dv\\=area(S)\end{array}\)

➇at this ending stage, we can also formulate below equation, where jacobian is just the ratio from $area(R)$ to $area(S)$:
\(area(R)=\left|J\right|\cdot area(S)\)

Binary and Multiple Classification

Classification

Classification could be categorized into two major variety, they are binary and multiple classifications. The major purpose of binary classification is to identify an object as two mutual exclusive identity by means of logistic regression. The multiple regression comes out with a wider rage of identity, might be 0, 1, 2, 3,...n, nevertheless, it is using one versus all, which is the same algorithm of binary classification.

Binary Classification

Given input data of size n, binary classification is to classify it to be true or false, or identity of another alternative, 0 and 1, and still further.
Before we step into multiple classification, a full illustration of the logistic regression model is mandatory.

Why We Need The Logistic Regression Model?

Suppose we’d like to classify the given input data as 0 or 1, in example of tumor size identification of maligant or benight.
We are given m records of tumor history and come out the fitted linear regression line in grapg labeled ➀, where m-1 th record is identified as non-maligant($Y=0$).

After record m+1, m+2 have been added to the given sample, we get the new fitted regression line in graph labeled ➁, watch out that m-1 th record is now identified as maligant($Y=1$)!!!

Caution must be made that by tradition, the fitted linear regression line could be biased with the input sample and $h_\theta(x)>1\;or\;<\;0$ could happen.

What we want for binary classification in this example is to classify
\(\left\{\begin{array}{l}Y=1,\;h_\theta(x)\geq0.5\\Y=0,\;h_\theta(x)<0.5\end{array}\right.\) The linear regression model is likely to have $h_\theta(x)>1\;or\;<\;0$; while the logistic regression model has $0\leq h_\theta(x)\leq1$!!!

The Logistic Regression Function

It is also named as the sigmoid function, and defined as: \(h_\theta(x)=\frac1{1+e^{-\theta^t\cdot x}},\;where\;\left\{\begin{array}{l}\theta\in R^n,\;M_{n\times1}\\x\in R^n,\;M_{n\times1}\end{array}\right.\)

We can then have $h_\theta(x)=P(Y=1\vert x;\theta),\;the\;probablity\;of\;Y=1\;given\;x\;and\;\theta$
➀$P(Y=1\vert x;\theta)+P(Y=0\vert x;\theta)=1$
➁$P(Y=1\vert x;\theta)=1-P(Y=0\vert x;\theta)$

The Logistic Regression Cost Function

Can we re-using the cost function in linear regression model, $J(\theta)=\frac1{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$ to be the logistic version of cost function? The answer is no!!! Because the gradient descendent wouldn’t be a smooth convex curve.

The major purpose of the cost function is to reduce the error of the model and gradient descendent could then be applied on to get the $\theta$ that can really have the smallest error.
Two key points must be clarified:
➀when $P(Y=1)\approx1$, the error from 1 for $P(Y=1)\approx0$ ➁when $P(Y=0)\approx0$, the error from 0 for $P(Y=0)\approx0$ We thus define the cost function as:
\(\left\{\begin{array}{l}-\log(h_\theta(x)),\;for\;Y=1\\-\log(1-h_\theta(x)),\;for\;Y=0\end{array}\right.\)

Why do we use log? Why to have a minus sign?

Please recall $\log1=0$, whenever $h_\theta(x)\approx1$ or $h_\theta(x)\approx0$.

Next to formulize it,
\(\begin{array}{l}J(\theta)=\frac1m\sum_{i=1}^m\cos t(h_\theta(x^{(i)}),y^{(i)}),\\where\;\cos t(h_\theta(x^{(i)}),y^{(i)})=\left\{\begin{array}{l}-\log(h_\theta(x^{(i)})),\;for\;y^{(i)}=1\\-\log(1-h_\theta(x^{(i)})),\;for\;y^{(i)}=0\end{array}\right.\end{array}\)

The graph exhibits some findings:
\(\left\{\begin{array}{l}\cos t(h_\theta(x^{(i)}),y^{(i)})=0,\;if\;h_\theta(x^{(i)})=y^{(i)}\\\cos t(h_\theta(x^{(i)}),y^{(i)})\approx\infty,\;if\;h_\theta(x^{(i)})\approx0,\;y^{(i)}=1\\\cos t(h_\theta(x^{(i)}),y^{(i)})\approx\infty,\;if\;h_\theta(x^{(i)})\approx1,\;y^{(i)}=0\end{array}\right.\)

Now, we combine the two cases into one formula:
\(\begin{array}{l}J(\theta)=-\frac1m\sum_{i=1}^m\lbrack y^{(i)}\cdot\log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot\log(1-h_\theta(x^{(i)}))\rbrack,\\where\;\left\{\begin{array}{l}y^{(i)}=1,\;we\;have\;1-y^{(i)}=0\\y^{(i)}=0,\;we\;have\;1-y^{(i)}=1\end{array}\right.\end{array}\) Thus, we chould have the 2 cases being operated in mutual exclusive flow.

Regularization of The Logistic Regression Cost Function

Please remind the need of regularization of cost function, or you might struggle in the overfitting embarrassed situation. We put an extra term at the end of existing cost function:
\(\begin{array}{l}\underset{REG}{J(\theta)}=-\frac1m\sum_{i=1}^m\lbrack y^{(i)}\cdot\log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot\log(1-h_\theta(x^{(i)}))\rbrack+\frac\lambda{2m}\sum_{j=1}^n\theta_j^2;\\where\;\theta\in R^n,\;M_{n\times1}\end{array}\)

For the ease of understand, we illustrate by intuition:
➀$\frac{\partial J(\theta)}{\partial\theta_j}=\frac1m\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)},for\;j=1,$ the bias term, no need for regularization!!!
➁$\frac{\partial J(\theta)}{\partial\theta_j}=\frac1m\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)}+\frac\lambda m\theta_j,\;for\;j>1$
;where the term $h_\theta(x^{(i)})-y^{(i)}$ is just the intuitive concept, the proof should make the derivation on the regularized version of cost function, next, we take the action.

Derivate the first part of cost function, we have below deduction:
\(\begin{array}{l}J(\theta)=-\frac1m\sum_{i=1}^m\lbrack y^{(i)}\cdot\log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot\log(1-h_\theta(x^{(i)}))\rbrack;\\where\;\;h_\theta(x)=\frac1{1+e^{-\theta^t\cdot x}},\;\theta\in R^n,\;x\in R^n,\\\frac{\partial J(\theta)}{\partial\theta_j}=\frac1m\sum_{i=1}^m\lbrack-y^{(i)}\cdot(h_\theta(x^{(i)}))^{-1}-(1-y^{(i)})\cdot(1-h_\theta(x^{(i)}))^{-1}\cdot\frac{\partial{(1-h_\theta(x^{(i)}))}}{\partial\theta_j}\rbrack\end{array}\)

We make further expression:
➀the first order partial differential of $h_\theta(x^{(i)})$:
\(\begin{array}{l}\frac{\partial(h_\theta(x^{(i)}))}{\partial\theta_j}\\=\frac{\partial({\displaystyle\frac1{1+e^{-\theta^t\cdot x^{(i)}}}})}{\partial\theta_j}\\=\frac{\partial(1+e^{-\theta^t\cdot x^{(i)}})^{-1}}{\partial\theta_j}\\=(-1)\cdot\lbrack1+e^{-\theta^t\cdot x^{(i)}}\rbrack^{-2}\cdot(-x_j^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}})\\=\lbrack1+e^{-\theta^t\cdot x^{(i)}}\rbrack^{-2}\cdot(x_j^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}})\end{array}\)

➁extend $1-h_\theta(x^{(i)})$ as:
\(1-h_\theta(x^{(i)})=1-\frac1{1+e^{-\theta^t\cdot x^{(i)}}}=\frac{e^{-\theta^t\cdot x^{(i)}}}{1+e^{-\theta^t\cdot x^{(i)}}}\)

Take ➀ and ➁ into the first order partial differential of $J(\theta)$:
\(\begin{array}{l}\frac{\partial J(\theta)}{\partial\theta_j}\\=\frac1m\sum_{i=1}^m\lbrack-y^{(i)}\cdot(1+e^{-\theta^t\cdot x^{(i)}})\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-2}\cdot(x_j^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}})+\\(1-y^{(i)})\cdot(\frac{1+e^{-\theta^t\cdot x^{(i)}}}{e^{-\theta^t\cdot x^{(i)}}})\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-2}\cdot(x_j^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}})\rbrack\\=\frac1m\sum_{i=1}^m\lbrack-y^{(i)}\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-1}\cdot(x_j^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}})+\\(1-y^{(i)})\cdot e^{\theta^t\cdot x^{(i)}}\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-1}\cdot(x_j^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}})\rbrack\\=\frac1m\sum_{i=1}^m\lbrack-y^{(i)}\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-1}\cdot(e^{-\theta^t\cdot x^{(i)}}\cdot x_j^{(i)})+\\(1-y^{(i)})\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-1}\cdot x_j^{(i)})\rbrack\\=\frac1m\sum_{i=1}^m\lbrack-y^{(i)}\cdot(e^{-\theta^t\cdot x^{(i)}}\cdot x_j^{(i)})+(1-y^{(i)})\cdot x_j^{(i)})\rbrack\cdot(1+e^{-\theta^t\cdot x^{(i)}})^{-1}\\=\frac1m\sum_{i=1}^m\lbrack-y^{(i)}\cdot(e^{-\theta^t\cdot x^{(i)}}\cdot x_j^{(i)})+(1-y^{(i)})\cdot x_j^{(i)})\rbrack\cdot h_\theta(x^{(i)})\end{array}\)

Finally, take the second part together in the derivative, we get the total solution:
\(\frac{\partial J(\theta)_{REG}}{\partial\theta}=\frac{\displaystyle1}{\displaystyle m}\sum_{i=1}^m\lbrack-x^{(i)}\cdot y^{(i)}\cdot e^{-\theta^t\cdot x^{(i)}}+x^{(i)}\cdot(1-y^{(i)})\rbrack\cdot h_\theta(x^{(i)})+\frac\lambda m\theta\)

Multiple Classification

We now migrate to multiple classification. There might exist more than one pattern in the given sampling, how do you plan to classify the object to the correct identity? We introduce the most often used method, one-versus-all.
Take $h_\theta^{(k)}(x^{(i)})=P(y^{(i)}=k\vert\theta^{(k)},x^{(i)});\;k=1,2,3,\;i=1\;to\;m,\;the\;count\;of\;data\;set$. Next to illustrate one-versus-all method:

➀train with $h_\theta^{(1)}(x^{(i)})$ to get the desired $\theta^{(1)}$ vector for class 1.

➁train with $h_\theta^{(2)}(x^{(i)})$ to get the desired $\theta^{(2)}$ vector for class 2.

➂train with $h_\theta^{(3)}(x^{(i)})$ to get the desired $\theta^{(3)}$ vector for class 3.

After $\theta^{(k)}$ has been figured out, for any new input of $x^{(i)}$, where $i\geq m$, use $h_\theta^{(k)}(x^{(i)})$ to predict the output, and choose the maximum one as the object identity.

That is to say $\underset k{max}\;h_\theta^{(k)}(x^{(i)})$ as the identity of class k.

Overfitting versus Regularization

Regularization

Regularization aims at overfitting elimination; whereas underfitting should be improived by adding extra features in your hypothesis model.

Overfitting

Overfitting usually occurs with the symptom that the hypothesis model fits very well to the training data, but, when the model is processing the real thing, it wouldn’t predict as well as it had estimated in training stage.

Regularization by means of Lagrangian

When we suspect our hypothesis model is overfitting, it should have been regularized by means of lagrange multiplier. For the ease of explain and better understand, in this article, we’d like to illustrate with linear regression model.
Take the cost function $J(\theta)=\frac1{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$, where the purpose of S.L.R is to minimize $J(\theta)$. Suppose that you already realize what cost function is for.
Given $h_\theta(x)=\theta^t\cdot x$, where \(\theta=\begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}\) is the coefficients.
Take \(X=\begin{bmatrix}x_1\\x_2\end{bmatrix}\) as the input data, and $x_1=1$, it is the intercept, or the bias term. The superscribe(i) of $X$, $Y$ are the index of the input data record, means it is the i-th input data $X$, $Y$.
Now, put an extra item after the cost function could we regularize our existsing cost function:
\(\begin{array}{l}\underset\theta{min}\;J(\theta)=\frac1{2m}(\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^n\theta_j^2);\;where\;h_\theta(x)=\theta^t\cdot x\\\theta_j=\theta_j-\;\frac\alpha m\frac{\partial J(\theta)}{\partial\theta_j}\\\;\;\;\;=\theta_j-\frac\alpha m\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)}-\frac\lambda m\theta_j;\;for\;j=1\;to\;n\end{array}\)

Why Lagrangian Achieves Regularization?

Suppose you all realize and have the basic intuition about lagrange multiplier. In this section, we will take the squared error part as the level curve, and is it subject to the constraint function of circle.
Suppose again, the level curve hits the circle on the point $x$, very very close to the local extreme point of the circle, say it $a$. We’d like to bring down the delta($\delta$) as much as closed to zero, that is $\nabla f\approx\nabla g$.
Since the $\nabla g$ if the normal vector and is perpendicular to the local extreme point $a$, then, we can then believe that we have reduced the projection error of delta($\delta$), $\nabla f$ would be as plausible solution as $\nabla g$.

In the next topic, we will further illustrate regularization in logistic regression for binary classification.

Gradient Descendent

Gradient Descendent

Most widely used in machine learning with respect to linear regression, logistic regression for supervised learning. But, the gradient descendent itself actually knows nothing when to stop!!

Begin from Cost Function in Simple Linear Regressoin(S.L.R)

take the cost function $J(\theta)=\frac1{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$, where the purpose of S.L.R is to minimize $J(\theta)$.
given $h_\theta(x)=\theta^t\cdot x$, where \(\theta=\begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}\) is the coefficients
take \(X=\begin{bmatrix}x_1\\x_2\end{bmatrix}\) as the input data
and $x_1=1$, it is the intercept, or the bias term.
as to the divide by 2m is an artificial design:
➀$\frac12$ is to eliminate the power of 2 in the square, after differentiation is taken.
➁$\frac1m$ is to average all squared errors of m input rows of data.
➂the superscribe(i) of $X$, $Y$ are the index of the input data record, means it is the i-th input data $X$, $Y$

Step into Gradient ∇ of $J(\theta)$

suppose we are given m records of input data, they are $X$ and $Y$. ➀take derivation of J(θ) on θj, then we obtain:

J(θ)θj=1mi=1m(hθ(x(i))-y(i))·xj(i)

since we are running batch gradient descendent, 1m is required to average it.
➁if m = 1, then, we just have the j-th term as:

J(θ)θj=i=1m(hθ(x(i))-y(i))·xj(i)

Learning Rate α

it is used to converge the result of gradient descendent. The smaller value would believed to have a less fluctuation.

Put It Together

put everything together, we will have the j-the term of $\theta$, it is inclusive of the bias term $\theta_1$:

θj=θj-α·1mi=1m(hθ(x(i))-y(i))·xj(i), where j=1 to n, θRn

given XMm×n, θMn×1, YMm×1, x(i)Mn×1,

X=-(x(1))t--(x(2))t-...-(x(m))t-mxn

and, we just deduce out:

J(θ)θj=1mi=1m(hθ(x(i))-y(i))·xj(i)=(-(x(1))t--(x(2))t-...-(x(m))t-mxn·θ1θ2...θnn×1-y(1)y(2)...y(m)m×1)t·xj(1)xj(2)...xj(m)·1m

Why the Gradient Descendent Works?

We will illustrate with cost function of multiple parameters:

➀given f(x1,x2) a multi-parameters cost function, the gradient ∇ of f are fx1, fx2
➁suppose y=f(x1,x2)
➂take ε^2=12m·i=1m(f^(x1,x2)(i)-y(i))2
➃to get the minimum value of ε^2, we have to figure out by below function:

ε^2=argminx1,x212m·i=1m(f^(x1,x2)(i)-y(i))2

, where  ε^2 is a convex function
➄the gradient descendent is the algorithm that can gradually lead to final x1, x2 that can minimize  ε^2

suppose we are beginning from some base point x1(0), x2(0), where the upper subscribe(0) is the iteration index, indicates that it is the 0-th iteration.

for i=0 to some large value.
x1(i)=x1(i-1)-α·f(x1(i-1), x2(i-1))x1
x2(i)=x2(i-1)-α·f(x1(i-1), x2(i-1))x2
repeat above 2 steps in the same iteration, loop until the ε^2 could be believed at its minimum or the iteration index i is large enough.

next, we explain why we choose to minimize the gradient, actually, there exists 2 gradients, +∇ and −∇:
➀for +∇, the equation minus the positive gradient at learning rate α > 0, then, x1(i)=x1(i-1)-α·fx1(x1(i-1),x2(i-1))
the x1(i) would step back to the optimal minimum.

➁for −∇, the equation minus the negative gradient at learning rate α > 0, then, x1(i)=x1(i-1)+α·fx1(x1(i-1),x2(i-1))
the x1(i) would step forward to the optimal minimum.

both ➀, ➁ work for x2 in this example, we thus prove it!!

QR Decomposition

QR Decomposition

QR decomposition claims that A = B⋅X = B⋅D⋅E⋅X = Q⋅R. This article will show you the way to prove it, we will begin from Gram-Schmit Procedure, then, briefly introduce to the projection matrix, then, migrate to triangular matrix, finally to prove the QR decomposition.

Gram-Schmit Procedure

Given a set of linear independent vectors set S = {v1, v2,…, vp} ∈ Rm,
define vectors ui, 1 ≤ i ≤ p by
ui = vi − [((vi)t ⋅ u1) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((vi)t ⋅ u2) ∕ ((u2)t ⋅ u2)] ⋅ u2 − [((vi)t ⋅ u3) ∕ ((u3)t ⋅ u3)] ⋅ u3 − … − [((vi)t ⋅ ui−1) ∕ ((ui−1)t ⋅ ui−1)] ⋅ ui−1
the set T = {u1, u2,…, up} is a linear independent orthonormal set and aoan(S) = span(T)
we just have below holds:
u1 = v1
u2 = v2 − [((v2)t ⋅ u1) ∕ ((u1)t ⋅ u1)] ⋅ u1
u3 = v3 − [((v3)t ⋅ u1) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((v3)t ⋅ u2) ∕ ((u2)t ⋅ u2)] ⋅ u2
u4 = v4 − [((v4)t ⋅ u1) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((v4)t ⋅ u2) ∕ ((u2)t ⋅ u2)] ⋅ u2 − [((v4)t ⋅ u3) ∕ ((u3)t ⋅ u3)] ⋅ u3

up = vp − [((vp)t ⋅ u1) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((vp)t ⋅ u2) ∕ ((u2)t ⋅ u2)] ⋅ u2 − … − [((vp)t ⋅ up−1) ∕ ((up−1)t ⋅ up−1)] ⋅ up−1

Prove Gram-Schmit Procedure by means of Projection Matrix

Begin by projection matrix to prove Gram-Schmit Procedure illustrated in below pic:
Project y onto column space of x

take x ⋅ b = yproj, the projection of y onto C(x), where C(x) is the column space spanned by vector x
C(x) ⊥ (y − x ⋅ b), then
=>C(x) ⋅ (y − x ⋅ b) = 0,
=>xt ⋅ (y − x ⋅ b) = 0,
=>xt ⋅ x ⋅ b = xt ⋅ y,
=>b = (xt ⋅ x) ⋅ xt ⋅ y; where (xt ⋅ x) is the generalized inverse form,
=>yproj = x ⋅ (xt ⋅ x) ⋅ xt ⋅ y
∵x is itself a column vector, then (xt ⋅ x) = (xt ⋅ x)−1 just holds, for the vector x is in the spanning set/basis
∴yproj = [(xt ⋅ y) ∕ (xt ⋅ x)] ⋅ x = [(yt ⋅ x) ∕ (xt ⋅ x)] ⋅ x

To further explain Gram-Schmit Procedure in terms of Projection Matrix:

take y as v2, x as u1, where u1 = v1, then
u2 = v2 − [((u1)t ⋅ v2) ∕ ((u1)t ⋅ u1)] ⋅ u1, where the second term is just the projection of v2 onto u1
u3 = v3 − Projw2(v3), where w2 = Span(u1, u2)
     = v3 − Proju1(v3) − Proju2(v3)
     = v3 − [((u1)t ⋅ v3) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((u2)t ⋅ v3) ∕ ((u2)t ⋅ u2)] ⋅ u2
the flow is exhibited by below pic:

u4 = v4 − Projw3(v4), where w3 = Span(u1, u2, u3)
     = v4 − [((u1)t ⋅ v4) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((u2)t ⋅ v4) ∕ ((u2)t ⋅ u2)] ⋅ u2 − [((u3)t ⋅ v4) ∕ ((u3)t ⋅ u3)] ⋅ u3

finally, we can reach
up = vp − [((u1)t ⋅ vp) ∕ ((u1)t ⋅ u1)] ⋅ u1 − [((u2)t ⋅ vp) ∕ ((u2)t ⋅ u2)] ⋅ u2 − … − [((up−1)t ⋅ vp) ∕ ((up−1)t ⋅ up−1)] ⋅ up−1

Further refine the notation in Gram-Schmit and Formula Representation:

take S = {v1, v2,…, vp} to be S = {a1, a2,…, ap}
take T = {u1, u2,…, up} to be T = {b1, b2,…, bp}
b1 = a1
b2 = a2 − [((b1)t ⋅ a2) ∕ ((b1)t ⋅ b1)] ⋅ b1
b3 = a3 − [((b1)t ⋅ a3) ∕ ((b1)t ⋅ b1)] ⋅ b1 − [((b2)t ⋅ a3) ∕ ((b2)t ⋅ b2)] ⋅ b2
b4 = a4 − [((b1)t ⋅ a4) ∕ ((b1)t ⋅ b1)] ⋅ b1 − [((b2)t ⋅ a4) ∕ ((b2)t ⋅ b2)] ⋅ b2 − [((b3)t ⋅ a4) ∕ ((b3)t ⋅ b3)] ⋅ b3

bp = ap − [((b1)t ⋅ ap) ∕ ((b1)t ⋅ b1)] ⋅ b1 − [((b2)t ⋅ ap) ∕ ((b2)t ⋅ b2)] ⋅ b2 − [((b3)t ⋅ ap) ∕ ((b3)t ⋅ b3)] ⋅ b3 − … − [((bp−1)t ⋅ ap) ∕ ((bp−1)t ⋅ bp−1)] ⋅ bp−1
At this moment, the proof has validated Gram-Schmit by the projection matrix

Express Gram-Schmit Procedure in Matrix Product

Advance one step to represent Gram-Schmit Procedure by matrix product:

if we take Xi,j = ((bi)t ⋅ aj) ∕ ((bi)t ⋅ bi), then, we could have:
b1 = a1
b2 = a2 − X1,2 ⋅ b1
b3 = a3 − X1,3 ⋅ b1 − X2,3 ⋅ b2
b4 = a4 − X1,4 ⋅ b1 − X2,4 ⋅ b2 − X3,4 ⋅ b3

bp = ap − X1,p ⋅ b1 − X2,p ⋅ b2 − X3,p ⋅ b3 − … − Xp−2,p ⋅ bp−2 − Xp−1,p ⋅ bp−1

Then, express ai in terms of bi′s:

a1 = b1
a2 = b2 + X1,2 ⋅ b1
a3 = b3 + X1,3 ⋅ b1 + X2,3 ⋅ b2
a4 = b4 + X1,4 ⋅ b1 + X2,4 ⋅ b2 + X3,4 ⋅ b3

ap = bp + X1,p ⋅ b1 + X2,p ⋅ b2 + X3,p ⋅ b3 + … + Xp−2,p ⋅ bp−2 + Xp−1,p ⋅ bp−1

Further refine:

take Xi,i = 1, that is
a1 = X1,1 ⋅ b1
a2 = X1,2 ⋅ b1 + X2,2 ⋅ b2
a3 = X1,3 ⋅ b1 + X2,3 ⋅ b2 + X3,3 ⋅ b3
a4 = X1,4 ⋅ b1 + X2,4 ⋅ b2 + X3,4 ⋅ b3 + X4,4 ⋅ b4

ap = X1,p ⋅ b1 + X2,p ⋅ b2 + X3,p ⋅ b3 + … + Xp−2,p ⋅ bp−2 + Xp−1,p ⋅ bp−1 + Xp,p ⋅ bp

take X as an upper unit triangular matrix where Xi,i = 1 
X = 
X1,1 X1,2 X1,3 X1,4 .... X1,p  
 0    X2,2 X2,3 X2,4 .... X2,p
 0     0   X3,3 X3,4 .... X3,p
 0     0    0   X4,4 .... X4,p
   ..................
 0     0    0   0  ....   Xp,p

then,
Am×p = [a1|a2|…|ap], where ai ∈ Rm,
Bm×p = [b1|b2|…|bp], where bi ∈ Rm,

take D = diag(||b1||−1, ||b2||−1, ||b3||−1,…,||bp||−1)
take Q = B ⋅ D = diag(b1 ∕ ||b1||, b2 ∕ ||b2||, b3 ∕ ||b3||,…, bp ∕ ||bp||)
take E = diag(||b1||, ||b2||, ||b3||,…,||bp||)…to eliminate D
take R = E ⋅ X = [(||b1|| ⋅ X1)|(||b2|| ⋅ X2)|(||b3|| ⋅ X3)|,…,|(||bp|| ⋅ Xp)]…upper triangular matrix
then, A = B ⋅ X = B ⋅ D ⋅ E ⋅ X = Q ⋅ R …QR decomposition, where such Q ⋅ R is unique, ∵B ⋅ X is also unique, too.