Exploit Evaluation By Hoeffding Bounds
09 Oct 2019Prologue To Exploit Evaluation By Hoeffding Bounds
How Confident We Are In The Seleceted Target Bandit Arm::mjtsai1974
[1] The questionWe are still pulling bandit arm, here is the question, how do we believe that we choose the right bandit? How confident we are in the bandit thus choosed?
[2] Recap the Chebyshev's inequalityIn my prior post Hoeffding Inequality versus Chebyshev’s Inequality, we learn that for any random variable $Y$, the probabilistic error for the differrence in between $Y$ and its expected value $E\lbrack Y\rbrack$ greater than or equal to $\varepsilon$ is smaller than or equal to $\frac {Var\lbrack Y\rbrack}{\varepsilon^{2}}$:
$P(\vert Y-E\lbrack Y\rbrack\vert\geq\varepsilon)\leq\frac {Var\lbrack Y\rbrack}{\varepsilon^{2}}$Step any further, if we take $X_{i}$ for each distinct $i$-th test, and suppose we believe all $X_{i}$, $i$=$1,2,…,n$ are distributed with expected value $\mu$ and variance $\nu^{2}$, then:
$P(\vert \widehat\mu-\mu\vert\geq\varepsilon)\leq\frac {\nu^{2}}{n\cdot\varepsilon^{2}}$
, where $\widehat\mu$ is the estimator of $\mu$ from all $X_{i}$.Here we can say the estimator $\widehat\mu$ is bias within $\pm\varepsilon$ is probabilistically less than or equal to $\frac {\nu^{2}}{n\cdot\varepsilon^{2}}$.
The point is how many test cases, how many times we should make the test to convince us that we are confident in this result?
[3] By using the Hoeffding inequalityMy prior post Hoeffding Bounds has proved that it holds for $X_{i}$, $i$=$1,2,…,n$, each weak dependent or identical independent distributed random variables to limit its upper bound:
$P(\vert\sum_{i=1}^{n}(Z_i)\vert\ge n\cdot\varepsilon)\le$ 2$\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})$
$\Rightarrow P(\vert\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge n\cdot\varepsilon)\le$ 2$\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})$
$\Rightarrow P(\vert\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge\varepsilon)\le$ 2$\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})$
, where
➀$Z_i$=$X_i-E\lbrack X_i\rbrack$ for all $i$
➁$Z_i\in\lbrack a,b\rbrack$In this Bernoulli bandit arm distribution, we take $a=0$ and $b=1$, the whole average bias of inequality becomes:
$P(\vert\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge\varepsilon)\le$ 2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$Now we treat the $i$-th pull of this same bandit arm $X_{i}$=${0,1}$, and take $\widehat u$ to be the average estimated value, take $u$ to be the average expected value, which is the common Hoeffding inequality expression:
$P(\vert \widehat u-u\vert\ge\varepsilon)\le$ 2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$We could formally claim our bandit arm experiment finally brings us the right arm with the error bias to the true optimal arm limited to $\varepsilon$, which is probabilistically no more than 2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$.
If we choose $\delta$ to be this error of probability, then
The estimnated $\widehat u$ is probabilistically $1-\delta$ with this precision $[u-\varepsilon,u+\varepsilon]$.
$\delta$=2$\cdot exp(-2\cdot n\cdot\varepsilon^2)$
$\Rightarrow \varepsilon$=$(\sqrt {\frac {1}{2}\cdot ln\frac {2}{\delta}})/\sqrt {n}$
$\Rightarrow n$=$(\frac {1}{2}\cdot ln\frac {2}{\delta})/\varepsilon^{2}$