mjtsai1974's Dev Blog Welcome to mjt's AI world

Exploit Evaluation By Hoeffding Bounds

Prologue To Exploit Evaluation By Hoeffding Bounds

In MDP model, exploration is executed by state transition, exploitation is proceeded by repeating, looping over one node of state. The number of exploitation evaluated by using the Hoeffding bounds would come out the low error result of high probability, pervasively found in modern reinforcement learning.

How Confident We Are In The Seleceted Target Bandit Arm::mjtsai1974

[1] The question

We are still pulling bandit arm, here is the question, how do we believe that we choose the right bandit? How confident we are in the bandit thus choosed?

[2] Recap the Chebyshev's inequality

In my prior post Hoeffding Inequality versus Chebyshev’s Inequality, we learn that for any random variable $Y$, the probabilistic error for the differrence in between $Y$ and its expected value $E\lbrack Y\rbrack$ greater than or equal to $\varepsilon$ is smaller than or equal to $\frac {Var\lbrack Y\rbrack}{\varepsilon^{2}}$:
$P(\vert Y-E\lbrack Y\rbrack\vert\geq\varepsilon)\leq\frac {Var\lbrack Y\rbrack}{\varepsilon^{2}}$

Step any further, if we take $X_{i}$ for each distinct $i$-th test, and suppose we believe all $X_{i}$, $i$=$1,2,…,n$ are distributed with expected value $\mu$ and variance $\nu^{2}$, then:
$P(\vert \widehat\mu-\mu\vert\geq\varepsilon)\leq\frac {\nu^{2}}{n\cdot\varepsilon^{2}}$
, where $\widehat\mu$ is the estimator of $\mu$ from all $X_{i}$.

Here we can say the estimator $\widehat\mu$ is bias within $\pm\varepsilon$ is probabilistically less than or equal to $\frac {\nu^{2}}{n\cdot\varepsilon^{2}}$.

The point is how many test cases, how many times we should make the test to convince us that we are confident in this result?

[3] By using the Hoeffding inequality

My prior post Hoeffding Bounds has proved that it holds for $X_{i}$, $i$=$1,2,…,n$, each weak dependent or identical independent distributed random variables to limit its upper bound:
$P(\vert\sum_{i=1}^{n}(Z_i)\vert\ge n\cdot\varepsilon)\le$ 2$\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})$
$\Rightarrow P(\vert\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge n\cdot\varepsilon)\le$ 2$\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})$
$\Rightarrow P(\vert\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge\varepsilon)\le$ 2$\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})$
, where
➀$Z_i$=$X_i-E\lbrack X_i\rbrack$ for all $i$
$Z_i\in\lbrack a,b\rbrack$

In this Bernoulli bandit arm distribution, we take $a=0$ and $b=1$, the whole average bias of inequality becomes:
$P(\vert\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge\varepsilon)\le$ 2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$

Now we treat the $i$-th pull of this same bandit arm $X_{i}$=${0,1}$, and take $\widehat u$ to be the average estimated value, take $u$ to be the average expected value, which is the common Hoeffding inequality expression:
$P(\vert \widehat u-u\vert\ge\varepsilon)\le$ 2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$

We could formally claim our bandit arm experiment finally brings us the right arm with the error bias to the true optimal arm limited to $\varepsilon$, which is probabilistically no more than 2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$.

If we choose $\delta$ to be this error of probability, then
$\delta$=2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$
$\Rightarrow \varepsilon$=$(\sqrt {\frac {1}{2}\cdot ln\frac {2}{\delta}})/\sqrt {n}$
$\Rightarrow n$=$(\frac {1}{2}\cdot ln\frac {2}{\delta})/\varepsilon^{2}$

The estimnated $\widehat u$ is probabilistically $1-\delta$ with this precision $[u-\varepsilon,u+\varepsilon]$.

Addendum

Exploring Exploration, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)