mjtsai1974's Dev Blog Welcome to mjt's AI world

Bayes From Theorem To Practice

Prologue To Bayes From Theorem To Practice

The Bayes theorem is the quantitative critical thinking rather than the qualitative thinking of human nature. By using already known probability of feature to figure out unknown probability of interested feature in its maximum likelihood, the result would be more plausible.

You Are Given The Question

Given a dog, with 3 measurement results(lb) of weight, $13.9$,$17.5$,$14.1$ in the past 3 records of measurement. What's the exact weight of the dog at this moment? And the scale tells that it is weighted 14.2 lb in this concurrent measurement.

This article is illustrated from the example in How Bayesian inference works, Brandon Rohrer.

Can We Try The Unbiased Estimator?

[1]Since we are given a sample of $13.9$,$17.5$,$14.1$, how about the unbiased estimator?
➀by $\frac {X_{1}+X_{2}+X_{3}}{3}$=$\overline{X_{3}}$, this is to use sample average to approximate the mean. The current drawback might be that we have sample of only a few data.
➁next we look at sample variance:
$\frac {\sum_{i=1}^{n}(X_{i}-\overline{X_{n})^{2}}}{n-1}$=$S_{n}^{2}$,
where $S_{n}^{2}$ is the sample variance.
➂next we look at sample standard deviation:
$S_{n}$=$(\frac {\sum_{i=1}^{n}(X_{i}-\overline{X_{n})^{2}}}{n-1})^{\frac {1}{2}}$ is the sample standard deviation, $n$=$3$ in this example.
➃then, the the standard error term:
$se$=$(\frac {\sum_{i=1}^{n}(X_{i}-\overline{X_{n})^{2}}}{n})^{\frac {1}{2}}$, is the standard error in this sample.

[2]All above are major terms in modern probability and statistics, and the standard error term have another expression in linear regression:
➀suppose $Y$ is the real target, and $\widehat Y$ is the estimated value of target, in linear regression, the term $RSS$=$\sum_{i=1}^{n} (Y_{i}-\widehat Y_{i})^{2}$ is the residual sum of squares.
➁we denote $(\frac {RSS}{dof})^{\frac {1}{2}}$ as the the standard error in linear regression. In this example, $dof$ is unknown, since we have not yet build a linear regression model. By intuition, $dof$=$2$, because there is an average $\overline{X_{3}}$ taken into account.

[3]After simple calculation, we have:
➀$\overline{x_{3}}$=$15.167$, the little case of letter $x$ indicates the value.
➁$S_{n}^{2}$=$4.09333351$
➂$S_{n}$=$2.023$
➃$se$=$1.6519$

Before We Start

[1]We’d like to evaluate the possible accurate weight. By given, we have 3 already known weights in the sample. Suppose the weights are in normal distribution.
➀by $\overline{x_{3}}$=$15.167$, $S_{n}$=$2.023$, the shape of normal distribution is exhibited below:
➁by $\overline{x_{3}}$=$15.167$, $se$=$1.6519$, the shape of normal distribution is exhibited below:
The shape of normal distribution is more sharpen, if we use the population standard deviation(standard error) to be the standard deviation, which is smaller, this indicates some bias exists in between 2 sources of standard deviation, they are sample standard deviation and the population standard deviation(standard error).

[2]We have 3 already known weights in the sample, what are $P(13.9)$,$P(14.1)$ and $P(17.5)$?
➀should we treat $P(13.9)$=$P(14.1)$=$P(17.5)$=$\frac {1}{3}$? Since each measurement comes out with one value of weight, the occurrence of getting the weight.
➁or by using $\frac {1}{\sigma\cdot\sqrt {2\cdot \pi}}\cdot e^{-\frac {(x-\mu)^{2}}{2\cdot\sigma^{2}}}$ to calculate the respective probability?

There is no standard answer, yet!!!

Translate The Target Into The Correct Probabilistic Expression

It seems that the target of accurate weight is very clear, but, how to translate into the correct probabilistic expression is the major topic.
We denote $w$ to represent the weight, $m$ as each measurement. How to express the weight each time the dog is measured? It is determinant in the prior and the target posterior. Below lists 2 of my viewpoints:
[1]$P(w\vert m)$=$\frac {P(m\vert w)\cdot P(w)}{P(m)}$
➀this is to ask the maximum probability of the real weight from the given measurement, the target posterior is $P(w\vert m)$, and the $P(w)$ is the already known weight(lb), the prior.
➁suppose the measurement process is under certain quality control, and we no doubt the result.

[2]$P(m\vert w)$=$\frac {P(w\vert m)\cdot P(m)}{P(w)}$
➀this is to ask the maximum probability of the accuracy of the measurement result from the given weight(to be believed such quantitative weight is verified and correct), the target posterior is $P(m\vert w)$, and the $P(m)$ is the already known measurement result(lb), the prior.
➁the major purpose of this expression should work when we'd like to evaluate the accuracy of the measurement process or result, it’s the condition that we doubt the measurement result, we’d like to make further enhancement in the measurement equipment or the process.

[3]Therefore, we choose $P(w\vert m)$=$\frac {P(m\vert w)\cdot P(w)}{P(m)}$ as our Bayes expression, and $P(m\vert w)$ is the likelihood function for the probability of the measurement outcomes, $13.9$,$17.5$,$14.1$, given the true weight $w$.

A Dead Corner Leading To Nowhere

[1] We are still in a no-escape corner, why?

By choosing $P(w\vert m)$=$\frac {P(m\vert w)\cdot P(w)}{P(m)}$ as our Bayes expression only makes it a point in intuition layer.
➀we don’t know how to make $P(w)$,$P(m)$,$P(m\vert w)$ quantitative.
➁should we treat $P(m\vert w)$ to be in uniform or Gaussian distribution?

[2] Try to escape.

To be compliant with the Bayes expression of our choice, let’s back to the inference for the total probability $P(m)$ of the measurement.
➀suppose the dog’s weight would vary by time, thus we are given 3 different measurement results of weight.
$P(m)$=$P(m\vert w_{1})\cdot P(w_{1})$+$P(m\vert w_{2})\cdot P(w_{2})$+$P(m\vert w_{3})\cdot P(w_{3})$
➁we believe there exists indeed a true value of dog's weight, $w_{real}$. Base on the total probability of these 3 measurements, we’d like to estimate out the probability of such $w_{real}$ by $P(w_{real}\vert m)$.
➂to further regularize our Bayes expression, let it becomes:
$P(w_{real}\vert m)$=$\frac {P(m\vert w_{real})\cdot P(w)}{P(m)}$, where $P(m)$ and $P(w_{real})$ are just constants.
➃we can toss out the 2 terms $P(m)$ and $P(w_{real})$. The working model now becomes:
$P(w_{real}\vert m)$=$P(m\vert w_{real})$

It's the possible direction we can escape away from the space non-going anywhere, as a result of the chosen Bayes expression constructed by insufficient sample data.

The Maximum Likelihood For $P(w_{real}\vert m)$

[1] The working model filled with given data

$P(w_{real}\vert m)$=$P(m\vert w_{real})$ leads us to the maximum likely possible real weight.
➀fill in what we are given:
$P(w_{real}\vert m=\{13.9,14.1,17.5\})$
=$P(m=\{13.9,14.1,17.5\}\vert w_{real})$
➁next to interpret $P(m\vert w_{real})$ by the Gaussian normal distribution.

Because this is fully compliant with the assumption that the accuracy of measurement result of weight is of no concern, and the weight varies by time, thus come out with different results.

[2] The way to do the maximum likelihood estimation

➀it is to be proceeded with a given possible real weight, to generate the maximum value of $P(m=\{13.9,14.1,17.5\}\vert w_{real})$, and such $w_{real}$ will yield the largest probability of $P(w_{real}\vert m=\{13.9,14.1,17.5\})$.
➁$\underset{w_{real}}{maxarg}P(m=\{13.9,14.1,17.5\}\vert w_{real})$
=$\underset{w_{real}}{maxarg}P(m=\{13.9\}\vert w_{real})$
$\;\;\;\;\cdot P(m=\{14.1\}\vert w_{real})$
$\;\;\;\;\cdot P(m=\{17.5\}\vert w_{real})$
This is just the maximum likelihood estimation for $w_{real}$.
➂to have more probability and be centralized toward the $\overline{X_{3}}$, this article choose $\overline{x_{3}}$=$15.167$, $se$=$1.6519$ to build the Gaussian normal distribution.
Take $w_{real}\in\{\overline{X_{3}}\pm se\}$, ranges from $13.5151$ to $16.8189$.
➃it’s not inclusive of the given measurement result $17.5$, to make it more complete:
$w_{real}$=$\{13,…,18\}$
let the data or the experiment result speaks!!

In my Python simulator, the maximum likelihood estimation of $w_{real}$ is $15.126$, for whatever sample or population standard deviation we choose to make the normal distribution.
The estimated $w_{real}$=$15.126$, which is very close to $\overline{x_{3}}$=$15.167$, is this the most appropriate value?

Make The Bayes Inference With The Given Prior

[1] The Bayes inference should begin with the given prior

Please recall that the scale tells the dog is weighted $14.2$ lb in this concurrent measurement, and this is the prior to make Bayes inference with.
➀assumed the given prior $P(w_{real})$=$14.2$ is true.
➁since $14.2$ is within the existing sample space, we can just take $\mu$=$14.2$ and the same sample or population standard deviation to build the mother normal distribution.

[2] Refine the working model

$P(w_{real}\vert m)$=$P(m\vert w_{real})\cdot P(w_{real})$
the Bayes inference should begin with the given prior, that’s why we put the term P(w_{real}) at the right side.
➁the term $P(m)$ is just a constant, could be safely tossed out.

[3] Design the new flow

We’ll still use the maximum likelihood estimation for the real weight, there will be some flow change:
➀take $\mu$=$14.2$ and the same sample or population standard deviation to build the mother normal distribution to simulate the real weight.
The red solid line is the normal distribution of sample standard deviation, the dashed blue line is the version of population standard deviation.
➁iterate through each measured weight in the sample space.
➂take the current iterated measured weight as $\mu$ to build a new normal distribution, centrally distributed in accordance with it.
➃calculate the $P(w_{cur})$ with regards to the normal probability distributed in the normal distribution build in ➀.
➄calculate the standard deviation in this new build normal distribution.
$P_{\mu=14.2}(17.5)$=$\frac {1}{\sigma_{\mu=17.5}\cdot\sqrt{2\cdot\pi}}\cdot e^{-\frac {1}{2}\cdot\frac {(x-17.5)}{\sigma_{\mu=17.5}^{2}}}$
$\Rightarrow\sigma_{\mu=17.5}$=$\frac {1}{P_{\mu=14.2}(17.5)\cdot\sqrt{2\cdot\pi}}$
;where $x=17.5$ and $P_{\mu=14.2}(17.5)$=$\frac {1}{\sigma_{\mu=14.2}\cdot\sqrt{2\cdot\pi}}\cdot e^{-\frac {1}{2}\cdot\frac {(17.5-14.2)}{\sigma_{\mu=14.2}^{2}}}$
thus, we can know the bandwidth of normal distribution regarding to $\mu$=$17.5$.
After the calculation with sample standard deviation $2.023$, we have $\sigma_{\mu=13.9}$=$2.046$, $\sigma_{\mu=14.1}$=$2.026$, $\sigma_{\mu=17.5}$=$7.651$.
And calculate with population standard deviation $1.652$, we have $\sigma_{\mu=13.9}$=$1.679$, $\sigma_{\mu=14.1}$=$1.655$, $\sigma_{\mu=17.5}$=$12.149$.
➅by using the maximum likelihood estimation(see [4]) to find out the maximum $P(w_{real})$:
$P(w_{real}\vert m)$=$P(m\vert w_{real}\cdot w_{real})$
➆back to ➁ until all sample weights are iterated.

[4] Bayes maximum likelihood estimation

$P(w_{real}\vert m)$=$P(m\vert w_{real})\cdot P(w_{real})$ leads us to the maximum likely possible real weight.
➀when iterating $17.5$, $w_{real}$=$17.5$.
$P(w_{real}\vert m=\{13.9,14.1,17.5\})$
=$P(m=\{13.9,14.1,17.5\}\vert w_{real})\cdot P(w_{real})$
=$P_{N(\mu=17.5)}(m=\{13.9\}\vert 17.5)\cdot P_{N(\mu=14.2)}(17.5)$
$\;\;\;\;\cdot P_{N(\mu=17.5)}(m=\{14.1\}\vert 17.5)\cdot P_{N(\mu=14.2)}(17.5)$
$\;\;\;\;\cdot P_{N(\mu=17.5)}(m=\{17.5\}\vert 17.5)\cdot P_{N(\mu=14.2)}(17.5)$
➁let’s regularize it:
$\underset{w_{real}}{maxarg}P(m=\{13.9,14.1,17.5\}\vert w_{real})$
=$\underset{w_{real}}{maxarg}P(m=\{13.9\}\vert w_{real})\cdot P(w_{real})$
$\;\;\;\;\cdot P(m=\{14.1\}\vert w_{real})\cdot P(w_{real})$
$\;\;\;\;\cdot P(m=\{17.5\}\vert w_{real})\cdot P(w_{real})$
This is just the maximum likelihood estimation for $w_{real}$ in this Bayes inference.
➂below exhibits the new distribution with regards to all sampling weights and the mother sample standard deviation.
Next it depects the new distribution with regards to all sampling weights and the mother population standard deviation.
Trivially, the new distribution centralized at $17.5$ has less, only a few probability to be the real weight.

In my Python simulator, the maximum likelihood estimation of $w_{real}$ is $14.1$, for whatever sample or population standard deviation we choose to make the mother normal distribution.

It seems that the final posterior is determinant by the given prior.

Addendum

How Bayesian inference works, Brandon Rohrer
Translated article of ➀
Bayes theorem and conditional probability