Another Tail of the Hypergeometric Distribution

In many cases of analysis of algorithms, we encounter the hypergeometric distribution, which is basically sampling without replacement. There are unfortunately a lot of different notations used, as is nicely summed up in Matthew Skala’s survey. I will not be making the situation better, as for this post I’ll use another notation still:

Chvátal MathWorld Wikipedia This post
Balls in urn N n+m N n
Balls that count M n K pn
Balls that don’t count N-M m N-K (1-p)n
Balls you draw n N n sn

We usually approximate the sampling as sampling with replacement, thereby getting the binomial distribution. Thus it comes as no surprise, that we can bound the tail of the hypergeometric just like the tail of the binomial. This has a sleek proof by Chvátal or simply using anti correlation and Hoefding.

\begin{aligned} \Pr[X\ge(p+t)s n] &\le \exp(-s n \text{D}(p+t||p))\\ &\le \exp(-2t^2sn) \end{aligned}

Here \text{D}(x||y)=x\log\frac{x}{y}+(1-x)\log\frac{1-x}{1-y} is the Kullback-Leibler divergence, and we’ve used that \text{D}(x||y)\ge2(x-y)^2.

However, while the binomial is only a good approximation to the hypergeometric, when the fraction of sampled elements $latex s$ is small. I’ve often wondered how much better we could actually do, when the number of sampled balls is close to the total number. The answer is as follows:

\begin{aligned} \Pr[X\le(p-t)s n] &\le \exp(-sn\text{D}(p-t||p)) &\le \exp(-2t^2sn)\\ \Pr[X\le(p-t)s n] &\le \exp(-(1-s)n\text{D}(p+\tfrac{ts}{1-s}||p)) &\le \exp(-\tfrac{s}{1-s}2t^2sn)\\ \Pr[X\ge(p+t)s n] &\le \exp(-sn\text{D}(p+t||p)) &\le \exp(-2t^2sn)\\ \Pr[X\ge(p+t)s n] &\le \exp(-(1-s)n\text{D}(p-\tfrac{ts}{1-s}||p)) &\le \exp(-\tfrac{s}{1-s}2t^2sn)\\ \end{aligned}

While the two ‘elegant’ versions of the bounds, in the third column, are somewhat weaker than the entropy bounds, they make the difference stand out clearly: For the hypergeometric distribution it is possible to add a factor \frac{s}{1-s} in the exponent. As s goes to 1, this quantity goes to \infty. This makes sense, since for s=1 the hypergeometric is deterministic. We also note that the new weak bounds are stronger as soon as s>1/2, however for the entropy bounds the values of $p$ and $t$ also plays in.

We can show the above bounds by making use of symmetry to swap sn for (1-s)n before applying Chvátal’s bound to this other tail. The symmetries are a bit wild in my notation, so I’ll use Chvátal’s original.

Let X be a random hypergeometric drawing n balls from an urn of N balls of which M balls count. Then we define the cumulative distribution functions H_{-}(M,N,n,k) = \Pr[X\le k] and H_{+}(M,N,n,k) = \Pr[X\ge k]. By swapping the roles of balls drawn and not drawn, as well as those counting for those not counting, we get the following:

\begin{aligned} H_{-}(M,N,n,k) &= H_{+}(M,N,N-n,M-k)\\ H_{-}(M,N,n,k) &= H_{+}(N-M,N,n,n-k)\\ H_{+}(M,N,n,k) &= H_{+}(N-M,N,N-n,N-M-n+k) \end{aligned}

The arguments are mostly similar, so we’ll just do bound number two:

\begin{aligned} \Pr[X\le(p-t)s n] &= H_{-}(pn,n,sn,(p-t)sn)\\ &= H_{+}(pn,n,n-sn,pn-(p-t)sn)\\ &= H_{+}(pn,n,(1-s)n,(p+\tfrac{ts}{1-s})(1-s)n)\\ &\le \exp(-(1-s)n\text{D}(p+\tfrac{ts}{1-s}||p)) \end{aligned}

Another way we might have gone around finding the bounds, would be studying Stirling approximation of the main term \log\left({p n \choose k}{(1-p) n \choose sn-k}/{n \choose sn}\right) = (\text{H}(\tfrac{k}{pn})+\text{H}(\tfrac{sn-k}{(1-p)n})-\text{H}(s))n + O(\log n). As it turns out, taking the series at s=0 or s=1 gives exactly the bounds from the beginning of this post. While not quite a lower/reverse bound, this does provide some idea of the sharpness of the bounds.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s