Hoeffding Bounds
08 Mar 2018Prologue To The Hoeffding Bounds
The Mjtsai1974 Light Weight Upper Bound
Before stepping into the major topic, make ourself a tiny break in a corner with a rather simple inequality, constrain still the same random variable as Hoeffding lemma. It’s just the trivial finding during my proof in these Hoeffding bounds. I nicknamed it the Mjtsai1974 Light Weight Upper Bound.
Given X∈[a,b], it’s a random variable with E[x]=0, then for any s>0, we have E[es⋅X]≤es⋅(a−b)
proof::mjtsai1974➀by given X∈[a,b], take λ=x−ab−a,
we just have x=(1−λ)⋅a+λ⋅b.
➁since g(E[s⋅X])≤E[g(s⋅X)] by Jensen's inequality for any convex function g(X)=eX, then
e(1−λ)⋅s⋅a+λ⋅s⋅b≤(1−λ)⋅es⋅a+λ⋅es⋅b
⇒E[e(1−λ)⋅s⋅a+λ⋅s⋅b]≤E[(1−λ)⋅es⋅a+λ⋅es⋅b]
➂that is to say, for any x∈X, we have
E[es⋅x]
≤E[(1−λ)⋅es⋅a+λ⋅es⋅b]
=E[(λ+(1−λ)⋅es⋅a−s⋅b)⋅es⋅b]
=E[(x−ab−a+(b−xb−a)⋅es⋅a−s⋅b)⋅es⋅b]
=(−ab−a+(bb−a)⋅es⋅(a−b))⋅es⋅b…E[x]=0
, which is equivalent to ask for its approximation.
➃take p=−ab−a, then 1−p=bb−a. From the two equalities, we can deduce out two expressions for b:
b=a⋅(p−1)p and b=(1−p)⋅(b−a)
Choose b=(1−p)⋅(b−a) to express b, then above inequality becomes
E[es⋅x]
≤(−ab−a+(bb−a)⋅es⋅(a−b))⋅es⋅b
=((1−p)⋅es⋅(a−b)+p)⋅es⋅(1−p)⋅(b−a)
=((1−p)⋅es⋅(a−b)+p)⋅es⋅(p−1)⋅(a−b)
The reason we choose such b expression is to simplify the inequality.
➄take u=a−b,θ(u)=((1−p)⋅es⋅u+p)⋅es⋅(p−1)⋅u, then
lnθ(u)
=ln((1−p)⋅es⋅u+p)+s⋅(p−1)⋅u
≤ln((1−p)⋅es⋅u+p)…since p−1≤0
≤ln((1−p)⋅es⋅u)…since p≤1
≤ln(es⋅u)
=s⋅u
, then θ(u)≤es⋅u
Therefore, E[es⋅X]≤es⋅(a−b) is thus proved, this is a simple, light weight inequality, I denote it Mjtsai1974 Light Weight Upper Bound.
The Hoeffding Lemma
Given X∈[a,b], it’s a random variable with E[x]=0, then for any s>0, we have E[es⋅X]≤es2⋅(b−a)28
proof::mjtsai1974We would inherit from ➄ in the proof of Mjtsai1974 Light Weight Upper Bound, begin from lnθ(u)=ln((1−p)⋅es⋅u+p)+s⋅(p−1)⋅u, where u=a−b,θ(u)=((1−p)⋅es⋅u+p)⋅es⋅(p−1)⋅u.
➀take M(u)=lnθ(u), by Taylor theorem, for w∈[u,0], where in this case u≤0, we have
lim
\approx ln\theta(u)+\frac {ln^{′}\theta(u)}{1!}\cdot (w-u)+\frac {ln^{″}\theta(u)}{2!}\cdot (w-u)^2+\frac {ln^{′″}\theta(u)}{3!}\cdot (w-u)^3+…
\le ln\theta(0)+\frac {ln^{′}\theta(0)}{1!}\cdot (0-u)+\frac {ln^{″}\theta(0)}{2!}\cdot (0-u)^2+\frac {ln^{′″}\theta(0)}{3!}\cdot (0-u)^3+…
➁from above, we must make first, second derivatives, they are
ln^{′}\theta(u)=(p+(1-p)\cdot e^{s\cdot u})^{-1}\cdot ((1-p)\cdot s\cdot e^{s\cdot u})+s\cdot(p-1)
ln^{″}\theta(u)=-1\cdot (p+(1-p)\cdot e^{s\cdot u})^{-2}\cdot ((1-p)\cdot s\cdot e^{s\cdot u})^{2}
\;\;\;\;\;\;\;\;+(p+(1-p)\cdot e^{s\cdot u})^{-1}\cdot ((1-p)\cdot s^2\cdot e^{s\cdot u})
And, ln\theta(0)=0, ln^{′}\theta(0)=0,
ln^{″}\theta(0)=-1\cdot((1-p)\cdot s)^2+(1-p)\cdot s^2=p\cdot (1-p)\cdot s^2
➂therefore, we’d like to maximum the term ln^{″}\theta(u),
take z=e^{s\cdot u}, s=\frac {ln(z)}{u}, then
ln^{″}\theta(u)…further regularized
=\frac {p\cdot (1-p)\cdot s^2\cdot e^{s\cdot u}}{(p+(1-p)\cdot e^{s\cdot u})^{2}}
=\frac {p\cdot (1-p)\cdot (\frac {ln(z)}{u})^2\cdot z}{(p+(1-p)\cdot z)^{2}}
\le \frac {p\cdot (1-p)\cdot z}{(p+(1-p)\cdot z)^{2}}…(\frac {ln(z)}{u})^2\le 1
➃take N(z)=\frac {p\cdot (1-p)\cdot z}{(p+(1-p)\cdot z)^{2}}, and ln^{″}\theta(u)=s^2\cdot N(z), this is equivalent to ask for the maximum of N(z), the formal procedure would be to take its first derivative and set it to 0, next to get the possible z expressed in terms of p, in this case, z is the input parameter in N(z).
N′(z)=\frac {p^{2}\cdot (1-p)-p\cdot (1-p)^{2}\cdot z}{(p+(1-p)\cdot z)^{3}}=0
After we solve it, we get z=\frac {p}{1-p} is the critical point.
➄take z=\frac {p}{1-p} in N(z)
N(z)=\frac {p\cdot (1-p)\cdot\frac {p}{1-p}}{(p+(1-p)\cdot\frac {p}{1-p})^{2}}=\frac {p^{2}}{4\cdot p^{2}}=\frac {1}{4}
Be recalled that ln^{″}\theta(u)=s^2\cdot N(z), back to the Taylor expansion in ➁, we have
ln\theta(u)
=M(u)
\le \frac {ln^{″}\theta(0)}{2!}\cdot(u)^{2}
=\frac {s^{2}\cdot N(z)}{2!}\cdot (u)^2
\le \frac {1}{2!}\cdot s^{2}\cdot\frac {1}{4}\cdot u^{2}
=\frac {1}{8}\cdot s^{2}\cdot u^{2}
Therefore, \theta(u)\le e^{\frac {1}{8}\cdot s^{2}\cdot u^{2}}
➅finally, we just proved that
E\lbrack e^{s\cdot X}\rbrack\le \theta(u)\le e^{\frac {1}{8}\cdot s^{2}\cdot u^{2}}=e^{\frac {s^{2}\cdot (a-b)^{2}}{8}}=e^{\frac {s^{2}\cdot (b-a)^{2}}{8}}
The Hoeffding Inequality
Given that X_1,X_2,,,X_n are independent random variables with a_i\le X_i\le b_i, then
\;\;\;\;P(\frac {1}{n}\cdot\sum_{1}^{n}(X_i-E\lbrack X_i\rbrack)\ge\varepsilon)\le exp(\frac {-2\cdot n^2\cdot\varepsilon^2}{\sum_{1}^{n}(b_i-a_i)^2})Proof:
We’d like to prove it by Hoeffding lemma.
➀take Z_i=X_i-E\lbrack X_i\rbrack, for all i, then Z_i\in\lbrack a_i,b_i\rbrack and E\lbrack Z_i\rbrack=0 just holds.
P(\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack))=P(\frac {1}{n}\cdot\sum_{i=1}^{n}(Z_i)), then
P(\frac {1}{n}\cdot\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\ge\varepsilon)=P(\frac {1}{n}\cdot\sum_{i=1}^{n}(Z_i)\ge\varepsilon)
\Rightarrow P(\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\ge n\cdot\varepsilon)=P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)
➁take t=n\cdot\varepsilon, then we have
P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)
=P(\sum_{i=1}^{n}(Z_i)\ge t)
=P(exp(\sum_{i=1}^{n}(s\cdot Z_i))\ge exp(s\cdot t))
\le E\lbrack (\prod_{i=1}^{n}exp(s\cdot Z_i))\rbrack\cdot exp(-s\cdot t)
=\prod_{i=1}^{n} E\lbrack exp(s\cdot Z_i)\rbrack\cdot exp(-s\cdot t)
\le exp(-s\cdot t)\cdot\prod_{i=1}^{n} exp(\frac {s^2\cdot (b_i-a_i)^2}{8})
=exp(\sum_{i=1}^{n}\frac {s^2\cdot (b_i-a_i)^2}{8}-s\cdot t)
➂it’s equivalent to find the minimum of the term, the formal procedure would be to take the first derivative with respect to s, set it to 0, solve s in terms of t, since s is a further expanded term in the proof process, we should eliminate s.
After that we find s=\frac {4\cdot t}{\sum_{i=1}^{n}(b_i-a_i)^2} is the critical point.
Therefore,P(\sum_{i=1}^{n}(Z_i)\ge t)\le exp(\frac {-2\cdot t^2}{\sum_{i=1}^{n}(b_i-a_i)^2})
Finally, replace t with n\cdot\varepsilon, we have it proved that
P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)\le exp(\frac {-2\cdot n^2\cdot\varepsilon^2}{\sum_{i=1}^{n}(b_i-a_i)^2})If we are given all Z_i\in\lbrack a,b\rbrack, then the inequality would become
P(\sum_{i=1}^{n}(Z_i)\ge n\cdot\varepsilon)\le exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})By symmetry, we have it that
P(\vert\sum_{i=1}^{n}(Z_i)\vert\ge n\cdot\varepsilon)\le 2\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})
\Rightarrow P(\vert\sum_{i=1}^{n}(X_i-E\lbrack X_i\rbrack)\vert\ge n\cdot\varepsilon)\le 2\cdot exp(\frac {-2\cdot n\cdot\varepsilon^2}{(b-a)^2})