Given a pair of probability density functions and an interpolation factor consider the mixture How does the entropy of this mixture vary as a function of ? The widget below shows what this situation looks like for a pair of discrete distributions on the plane. (Drag to adjust )
It turns out that entropy is concave as a function of probabilities. This explains the "upward bulge" of our curve. It might also be intuitive to you that, when and are less similar, this curve should bulge upwards more. What exactly does its shape tell us? In this short post, we'll see how questions about lead to information about the relationship between and For example, we'll see how JSD divergence, KL divergence, and divergence can all be read off from Actually, this function seems to hold a lot of information about the relationship between and —perhaps everything you would ever want to know.
Entropy being concave means that the inequality holds for all pairs and scalars Actually, this kind of inequality has a very important interpretation: in general, the gap in a Jensen inequality on entropy measures mutual information.
To see how this works, let's introduce a Bernoulli variable Let with probability and otherwise, and let be a draw from when and a draw from when Overall, In terms of the variables and the weighted average of entropies is called the conditional entropy We can interpret conditional entropy as the average surprisal we pay on our prediction of if we're given the value of in advance. Our inequality above states that is never larger than which I hope is intuitive information-theoretically! Mutual information measures the gap ; in other words, For two variables and in general, we would write where is understood as a random distribution depending on The bound implied by non-negativity of mutual information is called a Jensen inequality, and it follows from concavity of
Now, the idea that measures how much knowing decreases our expected surprisal about leads to a way to express in terms of KL divergences between the distributions and Explicitly, We can also put our two variables in the other order and ask how much tells us about Where denotes the KL divergence between two Bernoulli distributions with rates and we find that which is the same as our earlier expression with the order of integration swapped. (Just write the expectation as an integral and cancel out the factors of !) There are a lot of ways to think about this kind of expression.
After a little more rearranging, we can also show that is an -divergence between the distributions and meaning it can be written in the form for some convex function with a few technical properties. When is called the Jensen-Shannon divergence. Jensen-Shannon divergence sounds like a pretty natural measurement of the discrepancy between two distributions; if I secretly flip a coin and send you either a draw from or depending on the outcome, measures how much I've implicitly told you about my coin flip. Unlike KL divergence, it is bounded and symmetric.
So far, we've seen that the bulge in the curve measures mutual information between a draw from and a fictitious latent variable and we've remembered how to write down explicitly. However, what's really interesting is how this quantity changes as a function of For starters, it is always true that "mixing" with a small factor of another distribution increases entropy? A little thought should convince you this is not necessarily true unless is a Dirac distribution: any infinitesimal perturbation of can be achieved by "mixing in" some distribution and the only local minima of entropy are the Dirac distributions.
To find out when mixing in increases entropy, let's differentiate. Using that we find where denotes the cross-entropy (Throughout this post, we'll assume that and that limits mercifully commute with integrals. This second assumption is always true, for example, when and are distributions over a finite set.)
I'm not sure if there's any name for this quantity I've started calling it "proclivity" and denoting it by We can relate proclivity to entropy and KL divergence by but, unlike KL divergence, proclivity can either be positive or negative. For instance, when is a Dirac distribution, and can certainly be either larger or smaller than its average value. You can play around with the proclivity between two distributions in the widget below. (Drag to adjust the center of the red distribution.)
Proclivity also arises when differentiating cross-entropy with respect to temperature. Explicitly, given a family of distributions depending on an inverse temperature it turns out that This is the justification for my name: the proclivity is positive exactly when is "predisposed" to being a good predictive distribution for in the sense that decreasing temperature causes a first-order decrease in cross-entropy. If we arrange the four quantities of the form in a little square, proclivities go at right angles to KL divergences: (In this diagram, an arrow stands for an equation )
The derivative of mutual information is a more familiar quantity. In general you can check that and at we simply have Moreover, by convexity, the first-order approximation is an upper bound: To wrap up this post, let's think about what this approximation means and take a look at the higher-order terms of the Taylor series for near
There's a remarkable interpretation of our upper bound in terms of predictive surprisal, which goes as follows. Suppose we want to predict the variable Given access to we can predict with only nats of surprisal per sample by using when and otherwise. On the other hand, always using as our predictive distribution costs nats of surprisal per sample, which is only more than what we paid when we knew Since mutual information bounds the extent by which knowledge of can decrease our average surprisal compared to the best predictor that is ignorant of this proves that !
Another point of view on our approximation comes from that classic Bayesian example of testing for a rare condition. If there is a small chance that and we observe the result of a diagnostic test how much do we expect to learn?
If we're already pretty sure that and this turns out to be the case, we don't expect to get much information from the test. This is true in terms of information gain; in response to a value Bayes' rule instructs us to increase the log likelihood of our belief that by but in expectation over this causes only an update of The main value of such a test is in detecting the unexpected event that Again thinking in terms of information gain, we compute that the expected update to the log likelihood of when is drawn from is Since with probability we conclude overall that the test result gives us nats of information about on average, in agreement with our calculation above.
You can see the behavior of for small values of in the widget below. I've graphed on a logarithmic scale. The shaded region shows the range of possible values mutual information could take, the bold line shows the true value of and the faint line shows the upper bound of
As you move the center of the red distribution to the right, our upper bound becomes worse because the approximation deteriorates—the function is coming closer to having a singularity at
Finally, let's look at higher-order terms of the series for A simple calculation shows that where are the moments One way to derive this is by thinking about the power series for entropy as a function of the density function It suffices to know the power series for in which is One interesting feature of this series is its second-order term, which involves the quantity This is sometimes called the Neyman divergence, another well-known -divergence used in hypothesis testing.
We've seen so far that, besides the entropies and the function tells us about and the divergence These quantities, along with itself, can all be written in the form for some functions In fact, our series expansion says more: determines all the moments of under ! Under some regularity conditions—for example, if and have finite support—we conclude that fully describes the distribution of Note that the distribution of under determines its distribution under and indeed under for any so we can speak generally of the "distribution of likelihood ratios between two distributions."
Another way to state our conclusion is that, if we take the manifold of positive distributions over a finite set and equip it with only the function and the ability to parameterize straight lines between distributions, then the distribution of likelihood ratios between a pair of distributions is an isomorphism invariant. Neat!