← cgad.ski 2025-06-30

Entropy of a Mixture

Given a pair (p0,p1)(p_0, p_1) of probability density functions and an interpolation factor λ[0,1],\lambda \in [0, 1], consider the mixture pλ=(1λ)p0+λp1.p_\lambda = (1 - \lambda) p_0 + \lambda p_1. How does the entropy H(pλ)=EXpλlnpλ(X)H(p_\lambda) = -\E_{X \sim p_\lambda} \ln p_\lambda(X) of this mixture vary as a function of λ\lambda? The widget below shows what this situation looks like for a pair of discrete distributions on the plane. (Drag to adjust λ.\lambda.)

It turns out that entropy is concave as a function of probabilities. This explains the "upward bulge" of our curve. It might also be intuitive to you that, when p0p_0 and p1p_1 are less similar, this curve should bulge upwards more. What exactly does its shape tell us? In this short post, we'll see how questions about H(pλ)H(p_\lambda) lead to information about the relationship between p0p_0 and p1.p_1. For example, we'll see how JSD divergence, KL divergence, and χ2\chi^2 divergence can all be read off from H(pλ).H(p_\lambda). Actually, this function seems to hold a lot of information about the relationship between p0p_0 and p1p_1—perhaps everything you would ever want to know.

Mutual Information

Entropy being concave means that the inequality H(pλ)(1λ)H(p0)+λH(p1)H(p_\lambda) \ge (1 - \lambda) H(p_0) + \lambda H(p_1) holds for all pairs (p0,p1)(p_0, p_1) and scalars λ[0,1].\lambda \in [0, 1]. Actually, this kind of inequality has a very important interpretation: in general, the gap in a Jensen inequality on entropy measures mutual information.

To see how this works, let's introduce a Bernoulli variable A.A. Let A=1A = 1 with probability λ\lambda and A=0A = 0 otherwise, and let XX be a draw from p1p_1 when A=1A = 1 and a draw from p0p_0 when A=0.A = 0. Overall, Xpλ.X \sim p_\lambda. In terms of the variables XX and A,A, the weighted average of entropies λH(p0)+(1λ)H(p1)\lambda H(p_0) + (1 - \lambda) H(p_1) is called the conditional entropy H(XA).H(X|A). We can interpret conditional entropy as the average surprisal we pay on our prediction of XX if we're given the value of AA in advance. Our inequality above states that H(XA)H(X|A) is never larger than H(X),H(X), which I hope is intuitive information-theoretically! Mutual information I(X;A)I(X;A) measures the gap H(X)H(XA)H(X) - H(X|A); in other words, H(pλ)=(1λ)H(p0)+λH(p1)+I(X;A).H(p_\lambda) = (1 - \lambda) H(p_0) + \lambda H(p_1) + I(X;A). For two variables XX and AA in general, we would write H(EAp(xA))=EAH(p(xA))+I(X;A)H(\E_A p(x|A)) = \E_A H(p(x|A)) + I(X;A) where p(xA)p(x|A) is understood as a random distribution depending on A.A. The bound H(EAp(xA))EAH(p(xA))H(E_A p(x|A)) \ge E_A H(p(x|A)) implied by non-negativity of mutual information is called a Jensen inequality, and it follows from concavity of H.H.

Now, the idea that I(X;A)I(X;A) measures how much knowing AA decreases our expected surprisal about XX leads to a way to express I(X;A)I(X;A) in terms of KL divergences between the distributions p0,p_0, p1p_1 and pλ.p_\lambda. Explicitly, I(X;A)=EA,X[lnp(X)lnp(XA)]=EAKL(posterior on Xprior on X)=(1λ)KL(p0pλ)+λKL(p1pλ).\begin{align*} I(X;A) & = \E_{A,X} \bigl[ \ln p(X) - \ln p(X|A) \bigr] \\ & = \E_A \KL(\text{posterior on } X \Vert \text{prior on } X) \\ & = (1 - \lambda) \KL(p_0 \Vert p_\lambda) + \lambda \KL(p_1 \Vert p_\lambda). \end{align*} We can also put our two variables in the other order and ask how much XX tells us about A.A. Where KL(xy)=xln(x/y)+(1x)ln((1x)/(1y))\KL(x \Vert y) = x \ln(x/y) + (1 - x) \ln((1 - x)/(1-y)) denotes the KL divergence between two Bernoulli distributions with rates xx and y,y, we find that I(X;A)=EXKL(posterior on Aprior on A)=EXKL(λp1pλλ)=EX[(1λ)p0pλ(ln(1λ)ln(1λ)p0pλ)=λp1pλ(lnλlnλp1pλ)]=EX[(1λ)p0pλlnp0pλ+λp1pλlnp1pλ],\begin{align*} I(X;A) & = \E_X \KL(\text{posterior on } A \Vert \text{prior on } A) \\ & = \E_X \KL\left( \frac{\lambda p_1}{p_\lambda} \middle \Vert \lambda \right) \\ & = \E_X \biggl[ -\frac{(1 - \lambda) p_0}{p_\lambda} \left( \ln (1 - \lambda) - \ln \frac{(1 - \lambda) p_0}{p_\lambda} \right) \\ & \phantom{=\int} -\frac{\lambda p_1}{p_\lambda} \left( \ln \lambda - \ln \frac{\lambda p_1}{p_\lambda} \right) \biggr] \\ & = \E_X \biggl[ \frac{(1 - \lambda) p_0}{p_\lambda} \ln \frac{p_0}{p_\lambda} +\frac{\lambda p_1}{p_\lambda} \ln \frac{p_1}{p_\lambda} \biggr], \end{align*} which is the same as our earlier expression with the order of integration swapped. (Just write the expectation as an integral and cancel out the factors of pλp_\lambda!) There are a lot of ways to think about this kind of expression.

After a little more rearranging, we can also show that I(X;A)I(X;A) is an ff-divergence between the distributions p0p_0 and p1,p_1, meaning it can be written in the form Ep0f(p0/p1)\E_{p_0} f(p_0/p_1) for some convex function ff with a few technical properties. When λ=1/2,\lambda = 1/2, I(X;A)I(X;A) is called the Jensen-Shannon divergence. Jensen-Shannon divergence sounds like a pretty natural measurement of the discrepancy between two distributions; if I secretly flip a coin and send you either a draw from p0p_0 or p1p_1 depending on the outcome, JSD(p0,p1)\operatorname{JSD}(p_0, p_1) measures how much I've implicitly told you about my coin flip. Unlike KL divergence, it is bounded and symmetric.

"Proclivity"

So far, we've seen that the bulge in the curve H(pλ)H(p_\lambda) measures mutual information I(X;A)I(X;A) between a draw from pλp_\lambda and a fictitious latent variable A,A, and we've remembered how to write down I(X;A)I(X;A) explicitly. However, what's really interesting is how this quantity changes as a function of λ.\lambda. For starters, it is always true that "mixing" p0p_0 with a small factor of another distribution p1p_1 increases entropy? A little thought should convince you this is not necessarily true unless p0p_0 is a Dirac distribution: any infinitesimal perturbation of p1p_1 can be achieved by "mixing in" some distribution p1,p_1, and the only local minima of entropy are the Dirac distributions.

To find out when mixing in p1p_1 increases entropy, let's differentiate. Using that ddλλ=0H(p+λq)=qlnpdx,\frac{d}{d \lambda}_{\lambda = 0} H(p + \lambda q) = -\int q \ln p \, dx, we find ddλλ=0H(pλ)=(p0p1)lnp0dx=H(p1,p0)H(p0)\begin{align*} \frac{d}{d \lambda}_{\lambda = 0} H(p_\lambda) & = \int (p_0 - p_1) \ln p_0 \, dx \\ & = H(p_1, p_0) - H(p_0) \\ \end{align*} where H(p1,p0)H(p_1, p_0) denotes the cross-entropy Ep1lnp0.-\E_{p_1} \ln p_0. (Throughout this post, we'll assume that p0>0p_0 > 0 and that limits mercifully commute with integrals. This second assumption is always true, for example, when p0p_0 and p1p_1 are distributions over a finite set.)

I'm not sure if there's any name for this quantity H(p1,p0)H(p0).H(p_1, p_0) - H(p_0). I've started calling it "proclivity" and denoting it by Pc(p1,p0).\operatorname{Pc}(p_1, p_0). We can relate proclivity to entropy and KL divergence by Pc(p1,p0)=KL(p1p0)+H(p1)H(p0)\operatorname{Pc}(p_1, p_0) = \KL(p_1 \Vert p_0) + H(p_1) - H(p_0) but, unlike KL divergence, proclivity can either be positive or negative. For instance, Pc(δy,p)=lnp(y)Eplnp(x)\operatorname{Pc}(\delta_y, p) = \ln p(y) - \E_p \ln p(x) when δy\delta_y is a Dirac distribution, and lnp(y)\ln p(y) can certainly be either larger or smaller than its average value. You can play around with the proclivity between two distributions in the widget below. (Drag to adjust the center of the red distribution.)

Proclivity also arises when differentiating cross-entropy with respect to temperature. Explicitly, given a family of distributions pβexp(βH)p_\beta \propto \exp(-\beta H) depending on an inverse temperature β,\beta, it turns out that ddββ=1H(q,pβ)=Pc(q,p).\frac{d}{d \beta}_{\beta = 1} H(q, p_\beta) = \operatorname{Pc}(q, p). This is the justification for my name: the proclivity Pc(q,p)\operatorname {Pc}(q, p) is positive exactly when pp is "predisposed" to being a good predictive distribution for qq in the sense that decreasing temperature causes a first-order decrease in cross-entropy. If we arrange the four quantities of the form Epilnpj-\E_{p_i} \ln p_j in a little square, proclivities go at right angles to KL divergences: H(q)Pc(p,q)H(p,q)KL(qp)KL(pq)H(q,p)Pc(q,p)H(p)\begin{CD} H(q) @>\operatorname{Pc}(p, q)>> H(p, q) \\ @V\KL(q \Vert p)VV @AA\KL(p \Vert q)A\\ H(q, p) @<<\operatorname{Pc}(q, p)< H(p) \end{CD} (In this diagram, an arrow ACBA \xrightarrow{C} B stands for an equation A+C=B.A + C = B.)

Mutual Information for Small λ\lambda

The derivative of mutual information is a more familiar quantity. In general you can check that ddλI(X;A)=KL(p1pλ)KL(p0pλ),\frac{d}{d \lambda} I(X;A) = \KL(p_1 \Vert p_\lambda) - \KL(p_0 \Vert p_\lambda), and at λ=0\lambda = 0 we simply have I(X;A)=λKL(p1p0)+O(λ2).I(X;A) = \lambda \KL(p_1 \Vert p_0) + O(\lambda^2). Moreover, by convexity, the first-order approximation is an upper bound: I(X;A)λKL(p1p0).I(X;A) \le \lambda \KL(p_1 \Vert p_0). To wrap up this post, let's think about what this approximation means and take a look at the higher-order terms of the Taylor series for I(X;A)I(X;A) near λ=0.\lambda = 0.

There's a remarkable interpretation of our upper bound in terms of predictive surprisal, which goes as follows. Suppose we want to predict the variable X.X. Given access to A,A, we can predict XX with only (1λ)H(p0)+λH(p1)(1 - \lambda) H(p_0) + \lambda H(p_1) nats of surprisal per sample by using p0p_0 when A=0A = 0 and p1p_1 otherwise. On the other hand, always using p0p_0 as our predictive distribution costs (1λ)H(p0)+λ(H(p1)+KL(p1p0))(1 - \lambda) H(p_0) + \lambda (H(p_1) + \KL(p_1 \Vert p_0)) nats of surprisal per sample, which is only λKL(p1p0)\lambda \KL(p_1 \Vert p_0) more than what we paid when we knew A.A. Since mutual information bounds the extent by which knowledge of AA can decrease our average surprisal compared to the best predictor that is ignorant of A,A, this proves that I(X;A)λKL(p1p0)I(X;A) \le \lambda \KL(p_1 \Vert p_0)!

Another point of view on our approximation comes from that classic Bayesian example of testing for a rare condition. If there is a small chance that A=1A = 1 and we observe the result of a diagnostic test X,X, how much do we expect to learn?

If we're already pretty sure that A=0A = 0 and this turns out to be the case, we don't expect to get much information from the test. This is true in terms of information gain; in response to a value X,X, Bayes' rule instructs us to increase the log likelihood of our belief that A=0A = 0 by ln(p0(X)/pλ(X)),\ln (p_0(X)/p_\lambda(X)), but in expectation over Xp0X \sim p_0 this causes only an update of Ep0[p0lnp0pλ]=KL(p0pλ)=O(λ2).\E_{p_0} \left[p_0 \ln \frac{p_0}{p_\lambda}\right] = \KL(p_0 \Vert p_\lambda) = O(\lambda^2). The main value of such a test is in detecting the unexpected event that A=1.A = 1. Again thinking in terms of information gain, we compute that the expected update to the log likelihood of A=1A = 1 when XX is drawn from p1p_1 is Ep1[p1lnp1pλ]=KL(p1p0)+O(λ).\E_{p_1} \left[ p_1 \ln \frac{p_1}{p_\lambda} \right] = \KL(p_1 \Vert p_0) + O(\lambda). Since A=1A = 1 with probability λ,\lambda, we conclude overall that the test result XX gives us λKL(p1p0)+O(λ2)\lambda \KL(p_1 \Vert p_0) + O(\lambda^2) nats of information about XX on average, in agreement with our calculation above.

You can see the behavior of I(X;A)I(X;A) for small values of λ\lambda in the widget below. I've graphed λ\lambda on a logarithmic scale. The shaded region shows the range 0I(X;A)I(A)0 \le I(X;A) \le I(A) of possible values mutual information could take, the bold line shows the true value of I(X;A),I(X;A), and the faint line shows the upper bound of λKL(p1p0).\lambda \KL(p_1 \Vert p_0).

As you move the center of the red distribution to the right, our upper bound becomes worse because the approximation KL(p1pλ)KL(p1p0)+O(λ)\KL(p_1 \Vert p_\lambda) \approx \KL(p_1 \Vert p_0) + O(\lambda) deteriorates—the function KL(p1q)\KL(p_1 \Vert q) is coming closer to having a singularity at q=p0.q = p_0.

Finally, let's look at higher-order terms of the series for I(X;A).I(X;A). A simple calculation shows that I(X;A)=λKL(p1p0)k=2(1)kk(k1)CkλkI(X;A) = \lambda \KL(p_1 \Vert p_0) - \sum_{k = 2}^\infty \frac{(-1)^k}{k(k - 1)} C_k \lambda^k where CkC_k are the moments Ck=Ep0(p1p0p0)k.C_k = \E_{p_0} \left(\frac{p_1 - p_0}{p_0}\right)^k. One way to derive this is by thinking about the power series for entropy H(p)H(p) as a function of the density function p.p. It suffices to know the power series for (q+p)ln(q+p)-(q + p) \ln (q + p) in p,p, which is qlnqplnqk=2(1)kk(k1)pkqk1.\begin{align*} {}- q \ln q - p \ln q - \sum_{k = 2}^\infty \frac{(-1)^k}{k(k - 1)} \frac{p^k}{q^{k - 1}}. \end{align*} One interesting feature of this series is its second-order term, which involves the quantity Ep0(p1p0)2p0=(p1p0)2p02dx.\E_{p_0} \frac{(p_1 - p_0)^2}{p_0} = \int \frac{(p_1 - p_0)^2}{p_0^2} \, dx. This is sometimes called the Neyman χ2\chi^2 divergence, another well-known ff-divergence used in hypothesis testing.

We've seen so far that, besides the entropies H(p0)H(p_0) and H(p1),H(p_1), the function H(pλ)H(p_\lambda) tells us about KL(p0p1),\KL(p_0 \Vert p_1), JSD(p0,p1)\operatorname{JSD}(p_0, p_1) and the χ2\chi^2 divergence χ2(p0,p1).\chi^2(p_0, p_1). These quantities, along with H(pλ)H(p_\lambda) itself, can all be written in the form Ep0f(p0p1)\E_{p_0} f\left( \frac{p_0}{p_1} \right) for some functions f.f. In fact, our series expansion says more: H(pλ)H(p_\lambda) determines all the moments of p0(X)/p1(X)p_0(X)/p_1(X) under Xp0X \sim p_0! Under some regularity conditions—for example, if p0p_0 and p1p_1 have finite support—we conclude that H(pλ)H(p_\lambda) fully describes the distribution of p0/p1.p_0/p_1. Note that the distribution of p0/p1p_0/p_1 under p0p_0 determines its distribution under p1p_1 and indeed under pλp_\lambda for any λ,\lambda, so we can speak generally of the "distribution of likelihood ratios between two distributions."

Another way to state our conclusion is that, if we take the manifold of positive distributions over a finite set and equip it with only the function HH and the ability to parameterize straight lines between distributions, then the distribution of likelihood ratios between a pair of distributions is an isomorphism invariant. Neat!

← cgad.ski