The Measurable Entropy

Maximum entropy, uniformity and normality.

Mort Yao


A brief introduction to basic information theory (entropy/information as a measure for theoretical unpredictability of data) and descriptive statistics (quantitative properties about real-world data including central tendency, dispersion and shape). The maximization of entropy under different constraints yields some common probability distributions: uniform distribution (given no prior knowledge); normal distribution (given that mean and variance are known).

Entropy as a measure1

For random variables \(X\) and \(Y\), we define sets \(\tilde X\) and \(\tilde Y\). Then the information entropy \(\operatorname{H}\) may be viewed as a signed measure \(\mu\) over sets: \[\begin{align*} \operatorname{H}(X) &= \mu(\tilde X) \\ \operatorname{H}(Y) &= \mu(\tilde Y) \\ \operatorname{H}(X,Y) &= \mu(\tilde X \cup \tilde Y) \qquad \text{(Joint entropy is the measure of a set union)} \\ \operatorname{H}(X\,|\,Y) &= \mu(\tilde X \setminus \tilde Y) \qquad \text{(Conditional entropy is the measure of a set difference)} \\ \operatorname{I}(X;Y) &= \mu(\tilde X \cap \tilde Y) \qquad \text{(Mutual information is the measure of a set intersection)} \end{align*}\] The inclusion–exclusion principle: \[\begin{align*} \operatorname{H}(X,Y) &= \operatorname{H}(X) + \operatorname{H}(Y) - \operatorname{I}(X;Y) \\ \mu(\tilde X \cup \tilde Y) &= \mu(\tilde X) + \mu(\tilde Y) - \mu(\tilde X \cap \tilde Y) \end{align*}\] Bayes’ theorem: \[\begin{align*} \operatorname{H}(X\,|\,Y) &= \operatorname{H}(Y\,|\,X) + \operatorname{H}(X) - \operatorname{H}(Y) \\ \mu(\tilde X \setminus \tilde Y) &= \mu(\tilde Y \setminus \tilde X) + \mu(\tilde X) - \mu(\tilde Y) \end{align*}\]

Entropy and data coding

Absolute entropy (Shannon entropy) quantifies how much information is contained in some data. For data compression, the entropy gives the minimum size that is needed to reconstruct original data (losslessly). Assume that we want to store a random binary string of length \(\ell\) (by “random”, we do not have yet any prior knowledge on what data to be stored). Under the principle of maximum entropy, the entropy of its distribution \(p(x)\) should be maximized: \[\max \operatorname{H}(X) = \max \left\{ -\sum_{x\in\mathcal{X}} p(x) \log p(x) \right\}\] given the only constraint \[\sum_{x\in\mathcal{X}} p(x) = 1\] Let \(\lambda\) be the Lagrange multiplier, set \[\mathcal{L} = - \sum_{x\in\mathcal{X}} p(x) \log p(x) - \lambda\left( \sum_{x\in\mathcal{X}} p(x) - 1 \right)\] We get \[\begin{align*} \frac{\partial\mathcal{L}}{\partial x} = -p(x)(\log p(x) + 1 + \lambda) &= 0 \\ \log p(x) &= - \lambda - 1 \\ p(x) &= c \qquad \text{(constant)} \end{align*}\] That is, the discrete uniform distribution maximizes the entropy for a random string. Since \(|\mathcal{X}| = 2^\ell\), we have \(p(x) = 2^{-\ell}\) and \(\operatorname{H}(X) = -\sum_{x\in\mathcal{X}} 2^{-\ell} \log_2 2^{-\ell} = \ell\) (bits). We conclude that the information that can be represented in a \(\ell\)-bit string is at most \(\ell\) bits. Some practical results include

  • In general, pseudorandom data (assume no prior knowledge) cannot be losslessly compressed, e.g., the uniform key used in one-time pad must have \(\log_2 |\mathcal{M}|\) bits (lower bound) so as not to compromise the perfect secrecy. (Further topic: Shannon’s source coding theorem)
  • Fully correct encoding/decoding of data, e.g., \(\mathsf{Enc}(m)\) and \(\mathsf{Dec}(c)\) algorithms in a private-key encryption scheme, must ensure that the probability distributions of \(m \in \mathcal{M}\) and \(c \in \mathcal{C}\) have the same entropy.
  • An algorithm with finite input cannot generate randomness infinitely. Consider a circuit that takes the encoded algorithm with some input (\(\ell\) bits in total) and outputs some randomness, the entropy of the output data is at most \(\ell\) bits. (Further topic: Kolmogorov complexity)

Relative entropy (KL divergence) quantifies how much information diverges between two sets of data. For data differencing, the KL divergence gives the minimum patch size that is needed to reconstruct target data (with distribution \(p(x)\)) given source data (with distribution \(q(x)\)).

In particular, if \(p(x) = q(x)\), which means that the two distributions are identical, we have \(\operatorname{D}_\mathrm{KL}(p\|q) = 0\). This follows our intuition that no information is gained or lost during data encoding/decoding. If \(p(x_0) = 0\) at \(x=x_0\), we take \(p(x) \log \frac{p(x)}{q(x)} = 0\), to justify the fact that the target data is trivial to reconstruct at this point, no matter how much information \(q(x)\) contains. However, if \(q(x_0) = 0\) at \(x=x_0\), we should take \(p(x) \log \frac{p(x)}{q(x)} = \infty\), so that the target data is impossible to reconstruct if we have only trivial \(q(x)\) at some point (unless \(p(x_0) = 0\)).

Lemma 4.1. (Gibbs’ inequality)2 The KL divergence is always non-negative: \(\operatorname{D}_\mathrm{KL}(p\|q) \geq 0\).

Informally, Lemma 4.1 simply states that in order to reconstruct target data from source data, either more information (\(\operatorname{D}_\mathrm{KL}(p\|q) > 0\)) or no further information (\(\operatorname{D}_\mathrm{KL}(p\|q) = 0\)) is needed.

Maximum entropy and normality

Theorem 4.2. Normal distribution \(\mathcal{N}(\mu,\sigma^2)\) maximizes the differential entropy for given mean \(\mu\) and variance \(\sigma^2\).

Proof.3 Let \(g(x)\) be a pdf of the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Let \(f(x)\) be an arbitrary pdf with the same mean and variance.

Consider the KL divergence between \(f(x)\) and \(g(x)\). By Lemma 4.1 (Gibbs’ inequality): \[\operatorname{D}_\mathrm{KL}(f\|g) = \int_{-\infty}^\infty f(x) \log \frac{f(x)}{g(x)} dx = \operatorname{H}(f,g) - \operatorname{H}(f) \geq 0\]

Notice that \[\begin{align*} \operatorname{H}(f,g) &= - \int_{-\infty}^\infty f(x) \log g(x) dx \\ &= - \int_{-\infty}^\infty f(x) \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) dx \\ &= \frac{1}{2} \left( \log(2\pi\sigma^2) + 1 \right) \\ &= \operatorname{H}(g) \end{align*}\] Therefore, \[\operatorname{H}(g) \geq \operatorname{H}(f)\] That is, the distribution of \(g(x)\) (Gaussian) always has the maximum entropy.

It is also possible to derive the normal distribution directly from the principle of maximum entropy, under the constraint such that \(\int_{-\infty}^\infty (x-\mu)^2f(x)dx = \sigma^2\).

The well-known central limit theorem (CLT) which states that the sum of independent random variables \(\{X_1,\dots,X_n\}\) tends toward a normal distribution may be alternatively expressed as the monotonicity of the entropy of the normalized sum: \[\operatorname{H}\left( \frac{\sum_{i=1}^n X_i}{\sqrt{n}} \right)\] which is an increasing function of \(n\). [1]

References and further reading


T. M. Cover and J. A. Thomas. Elements of Information Theory, 2nd ed.


[1] S. Artstein, K. Ball, F. Barthe, and A. Naor, “Solution of shannon’s problem on the monotonicity of entropy,” Journal of the American Mathematical Society, vol. 17, no. 4, pp. 975–982, 2004.