mjtsai1974's Dev Blog Welcome to mjt's AI world

Hoeffding Bounds

Prologue To The Hoeffding Bounds

The Hoeffding lemma and Hoeffding inequality are pervasively found in tremendous papers and lectures in machine learning, statistics, probability theory, such inequality relates the random variables belonging to the same distribution, to facilitate stepping toward the minimum bias with the suggested sampling size.

The Mjtsai1974 Light Weight Upper Bound

Before stepping into the major topic, make ourself a tiny break in a corner with a rather simple inequality, constrain still the same random variable as Hoeffding lemma. It’s just the trivial finding during my proof in these Hoeffding bounds. I nicknamed it the Mjtsai1974 Light Weight Upper Bound.

Given X[a,b], it’s a random variable with E[x]=0, then for any s>0, we have E[esX]es(ab)

proof::mjtsai1974

➀by given X[a,b], take λ=xaba,
we just have x=(1λ)a+λb.
➁since g(E[sX])E[g(sX)] by Jensen's inequality for any convex function g(X)=eX, then
e(1λ)sa+λsb(1λ)esa+λesb
E[e(1λ)sa+λsb]E[(1λ)esa+λesb]
➂that is to say, for any xX, we have
E[esx]
E[(1λ)esa+λesb]
=E[(λ+(1λ)esasb)esb]
=E[(xaba+(bxba)esasb)esb]
=(aba+(bba)es(ab))esbE[x]=0
, which is equivalent to ask for its approximation.
➃take p=aba, then 1p=bba. From the two equalities, we can deduce out two expressions for b:
b=a(p1)p and b=(1p)(ba)
Choose b=(1p)(ba) to express b, then above inequality becomes
E[esx]
(aba+(bba)es(ab))esb
=((1p)es(ab)+p)es(1p)(ba)
=((1p)es(ab)+p)es(p1)(ab)
The reason we choose such b expression is to simplify the inequality.
➄take u=ab,θ(u)=((1p)esu+p)es(p1)u, then
lnθ(u)
=ln((1p)esu+p)+s(p1)u
ln((1p)esu+p)…since p10
ln((1p)esu)…since p1
ln(esu)
=su
, then θ(u)esu
Therefore, E[esX]es(ab) is thus proved, this is a simple, light weight inequality, I denote it Mjtsai1974 Light Weight Upper Bound.

The Hoeffding Lemma

Given X[a,b], it’s a random variable with E[x]=0, then for any s>0, we have E[esX]es2(ba)28

proof::mjtsai1974

We would inherit from ➄ in the proof of Mjtsai1974 Light Weight Upper Bound, begin from lnθ(u)=ln((1p)esu+p)+s(p1)u, where u=ab,θ(u)=((1p)esu+p)es(p1)u.
➀take M(u)=lnθ(u), by Taylor theorem, for w[u,0], where in this case u0, we have
lim
\approx ln\theta(u)+\frac {ln^{′}\theta(u)}{1!}\cdot (w-u)+\frac {ln^{″}\theta(u)}{2!}\cdot (w-u)^2+\frac {ln^{′″}\theta(u)}{3!}\cdot (w-u)^3+…
\le ln\theta(0)+\frac {ln^{′}\theta(0)}{1!}\cdot (0-u)+\frac {ln^{″}\theta(0)}{2!}\cdot (0-u)^2+\frac {ln^{′″}\theta(0)}{3!}\cdot (0-u)^3+…
➁from above, we must make first, second derivatives, they are
ln^{′}\theta(u)=(p+(1-p)\cdot e^{s\cdot u})^{-1}\cdot ((1-p)\cdot s\cdot e^{s\cdot u})+s\cdot(p-1)
ln^{″}\theta(u)=-1\cdot (p+(1-p)\cdot e^{s\cdot u})^{-2}\cdot ((1-p)\cdot s\cdot e^{s\cdot u})^{2}
\;\;\;\;\;\;\;\;+(p+(1-p)\cdot e^{s\cdot u})^{-1}\cdot ((1-p)\cdot s^2\cdot e^{s\cdot u})
And, ln\theta(0)=0, ln^{′}\theta(0)=0,
ln^{″}\theta(0)=-1\cdot((1-p)\cdot s)^2+(1-p)\cdot s^2=p\cdot (1-p)\cdot s^2
➂therefore, we’d like to maximum the term ln^{″}\theta(u),
take z=e^{s\cdot u}, s=\frac {ln(z)}{u}, then
ln^{″}\theta(u)…further regularized
=\frac {p\cdot (1-p)\cdot s^2\cdot e^{s\cdot u}}{(p+(1-p)\cdot e^{s\cdot u})^{2}}
=\frac {p\cdot (1-p)\cdot (\frac {ln(z)}{u})^2\cdot z}{(p+(1-p)\cdot z)^{2}}
\le \frac {p\cdot (1-p)\cdot z}{(p+(1-p)\cdot z)^{2}}(\frac {ln(z)}{u})^2\le 1
➃take N(z)=\frac {p\cdot (1-p)\cdot z}{(p+(1-p)\cdot z)^{2}}, and ln^{″}\theta(u)=s^2\cdot N(z), this is equivalent to ask for the maximum of N(z), the formal procedure would be to take its first derivative and set it to 0, next to get the possible z expressed in terms of p, in this case, z is the input parameter in N(z).
N′(z)=\frac {p^{2}\cdot (1-p)-p\cdot (1-p)^{2}\cdot z}{(p+(1-p)\cdot z)^{3}}=0
After we solve it, we get z=\frac {p}{1-p} is the critical point.
➄take z=\frac {p}{1-p} in N(z)
N(z)=\frac {p\cdot (1-p)\cdot\frac {p}{1-p}}{(p+(1-p)\cdot\frac {p}{1-p})^{2}}=\frac {p^{2}}{4\cdot p^{2}}=\frac {1}{4}
Be recalled that ln^{″}\theta(u)=s^2\cdot N(z), back to the Taylor expansion in ➁, we have
ln\theta(u)
=M(u)
\le \frac {ln^{″}\theta(0)}{2!}\cdot(u)^{2}
=\frac {s^{2}\cdot N(z)}{2!}\cdot (u)^2
\le \frac {1}{2!}\cdot s^{2}\cdot\frac {1}{4}\cdot u^{2}
=\frac {1}{8}\cdot s^{2}\cdot u^{2}
Therefore, \theta(u)\le e^{\frac {1}{8}\cdot s^{2}\cdot u^{2}}
➅finally, we just proved that
E\lbrack e^{s\cdot X}\rbrack\le \theta(u)\le e^{\frac {1}{8}\cdot s^{2}\cdot u^{2}}=e^{\frac {s^{2}\cdot (a-b)^{2}}{8}}=e^{\frac {s^{2}\cdot (b-a)^{2}}{8}}

The Hoeffding Inequality

Given that X_1,X_2,,,X_n are independent random variables with a_i\le X_i\le b_i, then
\;\;\;\;P(\frac {1}{n}\cdot\sum_{1}^{n}(X_i-E\lbrack X_i\rbrack)\ge\varepsilon)\le exp(\frac {-2\cdot n^2\cdot\varepsilon^2}{\sum_{1}^{n}(b_i-a_i)^2})

Proof:
We’d like to prove it by Hoeffding lemma.
➀take Z_i=X_i-E\lbrack X_i\rbrack, for all i, then Z_i\in\lbrack a_i,b_i\rbrack and E\lbrack Z_i\rbrack=0 just holds.
P(\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack))=P(\frac {1}{n}\cdot\sum_{i=1}^{n}(Z_i)), then
P(\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\ge\varepsilon)=P(\frac {1}{n}\cdot\sum_{i=1}^{n}(Z_i)\ge\varepsilon)
\Rightarrow P(\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\ge n\cdot\varepsilon)=P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)
➁take t=n\cdot\varepsilon, then we have
P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)
=P(\sum_{i=1}^{n}(Z_i)\ge t)
=P(exp(\sum_{i=1}^{n}(s\cdot Z_i))\ge exp(s\cdot t))
\le E\lbrack (\prod_{i=1}^{n}exp(s\cdot Z_i))\rbrack\cdot exp(-s\cdot t)
=\prod_{i=1}^{n} E\lbrack exp(s\cdot Z_i)\rbrack\cdot exp(-s\cdot t)
\le exp(-s\cdot t)\cdot\prod_{i=1}^{n} exp(\frac {s^2\cdot (b_i-a_i)^2}{8})
=exp(\sum_{i=1}^{n}\frac {s^2\cdot (b_i-a_i)^2}{8}-s\cdot t)
➂it’s equivalent to find the minimum of the term, the formal procedure would be to take the first derivative with respect to s, set it to 0, solve s in terms of t, since s is a further expanded term in the proof process, we should eliminate s.
After that we find s=\frac {4\cdot t}{\sum_{i=1}^{n}(b_i-a_i)^2} is the critical point.
Therefore,P(\sum_{i=1}^{n}(Z_i)\ge t)\le exp(\frac {-2\cdot t^2}{\sum_{i=1}^{n}(b_i-a_i)^2})
Finally, replace t with n\cdot\varepsilon, we have it proved that
P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)\le exp(\frac {-2\cdot n^2\cdot\varepsilon^2}{\sum_{i=1}^{n}(b_i-a_i)^2})

If we are given all Z_i\in\lbrack a,b\rbrack, then the inequality would become
P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)\le exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})

By symmetry, we have it that
P(\vert\sum_{i=1}^{n}(Z_i)\vert\ge n\cdot\varepsilon)\le 2\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})
\Rightarrow P(\vert\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge n\cdot\varepsilon)\le 2\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})