← cgad.ski 2025-06-30

Entropy of a Mixture

Given a pair $(p_0, p_1)$ of probability density functions and an interpolation factor $\lambda \in [0, 1],$ consider the mixture $p_\lambda = (1 - \lambda) p_0 + \lambda p_1.$ How does the entropy $H(p_\lambda) = -\E_{X \sim p_\lambda} \ln p_\lambda(X)$ of this mixture vary as a function of $\lambda$ ? The widget below shows what this situation looks like for a pair of discrete distributions on the plane. (Drag to adjust $\lambda.$ )

It turns out that entropy is concave as a function of probabilities. This explains the "upward bulge" of our curve. It might also be intuitive to you that, when $p_0$ and $p_1$ are less similar, this curve should bulge upwards more. What exactly does its shape tell us? In this short post, we'll see how questions about $H(p_\lambda)$ lead to information about the relationship between $p_0$ and $p_1.$ For example, we'll see how JSD divergence, KL divergence, and $\chi^2$ divergence can all be read off from $H(p_\lambda).$ Actually, this function seems to hold a lot of information about the relationship between $p_0$ and $p_1$ —perhaps everything you would ever want to know.

Mutual Information

Entropy being concave means that the inequality $H(p_\lambda) \ge (1 - \lambda) H(p_0) + \lambda H(p_1)$ holds for all pairs $(p_0, p_1)$ and scalars $\lambda \in [0, 1].$ This kind of inequality has a very important interpretation: in general, the gap in a Jensen inequality on entropy measures mutual information.

To see how this works, let's introduce a Bernoulli variable $A.$ Let $A = 1$ with probability $\lambda$ and $A = 0$ otherwise, and let $X$ be a draw from $p_1$ when $A = 1$ and a draw from $p_0$ when $A = 0.$ Overall, $X \sim p_\lambda.$ In terms of the variables $X$ and $A,$ the weighted average of entropies $\lambda H(p_0) + (1 - \lambda) H(p_1)$ is called the conditional entropy $H(X|A).$ We can interpret conditional entropy as the average surprisal we pay on our prediction of $X$ if we're given the value of $A$ in advance. Our inequality above states that $H(X|A)$ is never larger than $H(X),$ which I hope is intuitive information-theoretically! Mutual information $I(X;A)$ measures the gap $H(X) - H(X|A)$ ; in other words, $H(p_\lambda) = (1 - \lambda) H(p_0) + \lambda H(p_1) + I(X;A).$ For two variables $X$ and $A$ in general, we would write $H(\E_A p(x|A)) = \E_A H(p(x|A)) + I(X;A)$ where $p(x|A)$ is understood as a random distribution depending on $A.$ The bound $H(E_A p(x|A)) \ge E_A H(p(x|A))$ implied by non-negativity of mutual information is called a Jensen inequality, and it follows from concavity of $H.$

Now, the idea that $I(X;A)$ measures how much knowing $A$ decreases our expected surprisal about $X$ leads to a way to express $I(X;A)$ in terms of KL divergences between the distributions $p_0,$ $p_1$ and $p_\lambda.$ Explicitly, $\begin{align*} I(X;A) & = \E_{A,X} \bigl[ \ln p(X) - \ln p(X|A) \bigr] \\ & = \E_A \KL(\text{posterior on } X \Vert \text{prior on } X) \\ & = (1 - \lambda) \KL(p_0 \Vert p_\lambda) + \lambda \KL(p_1 \Vert p_\lambda). \end{align*}$ We can also put our two variables in the other order and ask how much $X$ tells us about $A.$ Where $\KL(x \Vert y) = x \ln(x/y) + (1 - x) \ln((1 - x)/(1-y))$ denotes the KL divergence between two Bernoulli distributions with rates $x$ and $y,$ we find that $\begin{align*} I(X;A) & = \E_X \KL(\text{posterior on } A \Vert \text{prior on } A) \\ & = \E_X \KL\left( \frac{\lambda p_1}{p_\lambda} \middle \Vert \lambda \right) \\ & = \E_X \biggl[ -\frac{(1 - \lambda) p_0}{p_\lambda} \left( \ln (1 - \lambda) - \ln \frac{(1 - \lambda) p_0}{p_\lambda} \right) \\ & \phantom{=\int} -\frac{\lambda p_1}{p_\lambda} \left( \ln \lambda - \ln \frac{\lambda p_1}{p_\lambda} \right) \biggr] \\ & = \E_X \biggl[ \frac{(1 - \lambda) p_0}{p_\lambda} \ln \frac{p_0}{p_\lambda} +\frac{\lambda p_1}{p_\lambda} \ln \frac{p_1}{p_\lambda} \biggr], \end{align*}$ which is the same as our earlier expression with the order of integration swapped. (Just write the expectation as an integral and cancel out the factors of $p_\lambda$ !) There are a lot of ways to think about this kind of expression.

After a little more rearranging, we can also show that $I(X;A)$ is an $f$ -divergence between the distributions $p_0$ and $p_1,$ meaning it can be written in the form $\E_{p_0} f(p_0/p_1)$ for some convex function $f$ with a few technical properties. When $\lambda = 1/2,$ $I(X;A)$ is called the Jensen-Shannon divergence. Jensen-Shannon divergence sounds like a pretty natural measurement of the discrepancy between two distributions; if I secretly flip a coin and send you either a draw from $p_0$ or $p_1$ depending on the outcome, $\operatorname{JSD}(p_0, p_1)$ measures how much I've implicitly told you about my coin flip. Unlike KL divergence, it is bounded and symmetric.

"Proclivity"

So far, we've seen that the bulge in the curve $H(p_\lambda)$ measures mutual information $I(X;A)$ between a draw from $p_\lambda$ and a fictitious latent variable $A,$ and we've remembered how to write down $I(X;A)$ explicitly. However, what's really interesting is how this quantity changes as a function of $\lambda.$ For starters, it is always true that "mixing" $p_0$ with a small factor of another distribution $p_1$ increases entropy? A little thought should convince you this is not necessarily true unless $p_0$ is a Dirac distribution: any infinitesimal perturbation of $p_1$ can be achieved by "mixing in" some distribution $p_1,$ and the only local minima of entropy are the Dirac distributions.

To find out when mixing in $p_1$ increases entropy, let's differentiate. Using that $\frac{d}{d \lambda}_{\lambda = 0} H(p + \lambda q) = -\int q \ln p \, dx,$ we find $\begin{align*} \frac{d}{d \lambda}_{\lambda = 0} H(p_\lambda) & = \int (p_0 - p_1) \ln p_0 \, dx \\ & = H(p_1, p_0) - H(p_0) \\ \end{align*}$ where $H(p_1, p_0)$ denotes the cross-entropy $-\E_{p_1} \ln p_0.$ (Throughout this post, we'll assume that $p_0 > 0$ and that limits mercifully commute with integrals. This second assumption is always true, for example, when $p_0$ and $p_1$ are distributions over a finite set.)

I'm not sure if there's any name for this quantity $H(p_1, p_0) - H(p_0).$ I've started calling it "proclivity" and denoting it by $\operatorname{Pc}(p_1, p_0).$ We can relate proclivity to entropy and KL divergence by $\operatorname{Pc}(p_1, p_0) = \KL(p_1 \Vert p_0) + H(p_1) - H(p_0)$ but, unlike KL divergence, proclivity can either be positive or negative. For instance, $\operatorname{Pc}(\delta_y, p) = -\ln p(y) + \E_p \ln p(x)$ when $\delta_y$ is a Dirac distribution, and $\ln p(y)$ can certainly be either larger or smaller than its average value. You can play around with the proclivity between two distributions in the widget below. (Drag to adjust the center of the red distribution.)

Proclivity also arises when differentiating cross-entropy with respect to temperature. Explicitly, given a family of distributions $p_\beta \propto \exp(-\beta H)$ depending on an inverse temperature $\beta,$ it turns out that $\frac{d}{d \beta}_{\beta = 1} H(q, p_\beta) = \operatorname{Pc}(q, p).$ This is the justification for my name: the proclivity $\operatorname {Pc}(q, p)$ is positive exactly when $p$ is "predisposed" to being a good predictive distribution for $q$ in the sense that decreasing temperature causes a first-order decrease in cross-entropy. If we arrange the four quantities of the form $-\E_{p_i} \ln p_j$ in a little square, proclivities go at right angles to KL divergences: $\begin{CD} H(q) @>\operatorname{Pc}(p, q)>> H(p, q) \\ @V\KL(q \Vert p)VV @AA\KL(p \Vert q)A\\ H(q, p) @<<\operatorname{Pc}(q, p)< H(p) \end{CD}$ (In this diagram, an arrow $A \xrightarrow{C} B$ stands for an equation $A + C = B.$ )

Mutual Information for Small $\lambda$

The derivative of mutual information is a more familiar quantity. In general you can check that $\frac{d}{d \lambda} I(X;A) = \KL(p_1 \Vert p_\lambda) - \KL(p_0 \Vert p_\lambda),$ and at $\lambda = 0$ we simply have $I(X;A) = \lambda \KL(p_1 \Vert p_0) + O(\lambda^2).$ Moreover, by convexity, the first-order approximation is an upper bound: $I(X;A) \le \lambda \KL(p_1 \Vert p_0).$ To wrap up this post, let's think about what this approximation means and take a look at the higher-order terms of the Taylor series for $I(X;A)$ near $\lambda = 0.$

There's a remarkable interpretation of our upper bound in terms of predictive surprisal, which goes as follows. Suppose we want to predict the variable $X.$ Given access to $A,$ we can predict $X$ with only $(1 - \lambda) H(p_0) + \lambda H(p_1)$ nats of surprisal per sample by using $p_0$ when $A = 0$ and $p_1$ otherwise. On the other hand, always using $p_0$ as our predictive distribution costs $(1 - \lambda) H(p_0) + \lambda (H(p_1) + \KL(p_1 \Vert p_0))$ nats of surprisal per sample, which is only $\lambda \KL(p_1 \Vert p_0)$ more than what we paid when we knew $A.$ Since mutual information bounds the extent by which knowledge of $A$ can decrease our average surprisal compared to the best predictor that is ignorant of $A,$ this proves that $I(X;A) \le \lambda \KL(p_1 \Vert p_0)$ !

Another point of view on our approximation comes from that classic Bayesian example of testing for a rare condition. If there is a small chance that $A = 1$ and we observe the result of a diagnostic test $X,$ how much do we expect to learn?

If we're already pretty sure that $A = 0$ and this turns out to be the case, we don't expect to get much information from the test. This is true in terms of information gain; in response to a value $X,$ Bayes' rule instructs us to increase the log likelihood of our belief that $A = 0$ by $\ln (p_0(X)/p_\lambda(X)),$ but in expectation over $X \sim p_0$ this causes only an update of $\E_{p_0} \left[p_0 \ln \frac{p_0}{p_\lambda}\right] = \KL(p_0 \Vert p_\lambda) = O(\lambda^2).$ The main value of such a test is in detecting the unexpected event that $A = 1.$ Again thinking in terms of information gain, we compute that the expected update to the log likelihood of $A = 1$ when $X$ is drawn from $p_1$ is $\E_{p_1} \left[ p_1 \ln \frac{p_1}{p_\lambda} \right] = \KL(p_1 \Vert p_0) + O(\lambda).$ Since $A = 1$ with probability $\lambda,$ we conclude overall that the test result $X$ gives us $\lambda \KL(p_1 \Vert p_0) + O(\lambda^2)$ nats of information about $X$ on average, in agreement with our calculation above.

You can see the behavior of $I(X;A)$ for small values of $\lambda$ in the widget below. I've graphed $\lambda$ on a logarithmic scale. The shaded region shows the range $0 \le I(X;A) \le I(A)$ of possible values mutual information could take, the bold line shows the true value of $I(X;A),$ and the faint line shows the upper bound of $\lambda \KL(p_1 \Vert p_0).$

As you move the center of the red distribution to the right, our upper bound becomes worse because the approximation $\KL(p_1 \Vert p_\lambda) \approx \KL(p_1 \Vert p_0) + O(\lambda)$ deteriorates—the function $\KL(p_1 \Vert q)$ is coming closer to having a singularity at $q = p_0.$

Finally, let's look at higher-order terms of the series for $I(X;A).$ A simple calculation shows that $I(X;A) = \lambda \KL(p_1 \Vert p_0) - \sum_{k = 2}^\infty \frac{(-1)^k}{k(k - 1)} C_k \lambda^k$ where $C_k$ are the moments $C_k = \E_{p_0} \left(\frac{p_1 - p_0}{p_0}\right)^k.$ One way to derive this is by thinking about the power series for entropy $H(p)$ as a function of the density function $p.$ It suffices to know the power series for $-(q + p) \ln (q + p)$ in $p,$ which is $\begin{align*} {}- q \ln q - p \ln q - \sum_{k = 2}^\infty \frac{(-1)^k}{k(k - 1)} \frac{p^k}{q^{k - 1}}. \end{align*}$ The second-order term of our series for $I(X;A)$ involves the quantity $\E_{p_0} \frac{(p_1 - p_0)^2}{p_0} = \int \frac{(p_1 - p_0)^2}{p_0^2} \, dx,$ sometimes called the Neyman $\chi^2$ divergence, which is an $f$ -divergence often used in hypothesis testing.

We've seen so far that, besides the entropies $H(p_0)$ and $H(p_1),$ the function $H(p_\lambda)$ tells us about $\KL(p_0 \Vert p_1),$ $\operatorname{JSD}(p_0, p_1)$ and the $\chi^2$ divergence $\chi^2(p_0, p_1).$ These quantities, along with $H(p_\lambda)$ itself, can all be written in the form $\E_{p_0} f\left( \frac{p_0}{p_1} \right)$ for some functions $f.$ In fact, our series expansion says more: $H(p_\lambda)$ determines all the moments of $p_0(X)/p_1(X)$ under $X \sim p_0$ ! Under some regularity conditions—for example, if $p_0$ and $p_1$ have finite support—we conclude that $H(p_\lambda)$ fully describes the distribution of $p_0/p_1.$ Note that the distribution of $p_0/p_1$ under $p_0$ determines its distribution under $p_1$ and indeed under $p_\lambda$ for any $\lambda,$ so we can speak generally of the "distribution of likelihood ratios between two distributions."

Another way to state our conclusion is that, if we take the manifold of positive distributions over a finite set and equip it with only the function $H$ and the ability to parameterize straight lines between distributions, then the distribution of likelihood ratios between a pair of distributions is an isomorphism invariant. Neat!

← cgad.ski

Entropy of a Mixture

Mutual Information

"Proclivity"

Mutual Information for Small λ\lambdaλ

Mutual Information for Small $\lambda$