For such that Hölder's inequality states that so long as and are integrable functions (over any measure space) and all three integrals above converge. Furthermore, this inequality is tight in the sense that, for any there exists some making it into a non-trivial equality.
When I was an undergrad, I was a little terrified of inequalities like this one. Their proofs draw on a standard "bag of tricks", but it's not always obvious why to care about the inequalities themselves. Motivation is everything for me, so if I didn't understand why to care I would usually blank on the proof. What is Hölder's inequality really telling us?
One natural interpretation concerns the geometry of the -norms for From this perspective, Hölder's inequality is giving us a family of linear lower bounds on the -norm of This is meaningful from the point of view of convex geometry; indeed, convex homogeneous functions are precisely the ones that can be written as suprema of linear functions, and the statement that Hölder's inequality is tight amounts to exactly such a representation for the -norm: However, last week I was doing some Bayesian statistics and noticed that Hölder's inequality has another entirely different interpretation in terms of convexity. If we set and our pair of conjugate exponents become a pair of coefficients for a convex combination, and Hölder's inequality amounts to the statement that the partition function-style integral is log-convex with respect to Indeed, a function is log-convex exactly when for all convex combinations in its domain, and applying Hölder's inequality with and to the integral of gives (As an aside, it is not true in general that is convex as a function of the function whenever is convex and strictly increasing. See my stackexchange question. So, the convexity of the -norms and the log-partition function is not true for any simple "generic" reason.)
The simple observation that something is convex gives us access to some formidable insights. For example, suppose that we forgot the Hölder inequality but want to understand how behaves near a function When a convex function is differentiable, its first-order Taylor approximation gives us a lower bound. Since the -norm is convex, let's go ahead and take a variational derivative, assuming and for simplicity. We find that Since the -norm is homogeneous, the constant term of our first-order approximation ends up vanishing, leaving us with the linear lower bound with equality at After replacing with and fiddling around with conjugate exponents, we will get the Hölder inequality.
Now, "free energy" expressions of the sort come up a lot in Bayesian statistics. For example, when gives the joint log-likelihood of a fixed observation and a latent parameter is the log-likelihood that a distribution over latent parameters assigns to The likelihood value itself is known as the evidence for the distribution
In some sense, evaluating integrals like this one summarize the fundamental difficulty of performing ideal Bayesian inference. Indeed, when is tightly concentrated at some value of describing how varies under perturbations to ends up being as hard as finding this maximizer. If indexes over Turing machines or large neural networks, we have a problem.
Suppose we have a function that we can deal with. (For example, it could encode a simpler distribution over —like a mean-field approximation does in statistical physics.) What can tell us about ? Since we know that is a convex function, a variational derivative at will produce a lower bound on in terms of some information about near We compute that where takes an expectation with respect to the Boltzmann distribution with density proportional to Where denotes a variational derivative, we could write This justifies the claim we made a moment ago that, when the integrand is tightly concentrated, evaluating integrals for amounts to knowing its maximizer. In general, knowing how the evidence varies under infinitesimal changes to the joint log-likelihood is the same as being able to compute expectations over the posterior distribution. Indeed, the exact value of is given by Bayes' theorem as From this perspective, our variational expression for the posterior expectation corresponds to the fact that for small
Finally, notice that the variational derivative depends on only up to a constant. That makes sense, since If we translate to be a normalized log-likelihood, meaning we obtain the lower bound This inequality is known in Bayesian statistics as the evidence lower bound, and its slack is the well-known Kullback–Leibler divergence.
It is straightforward to derive the evidence lower bound directly; since integrates to we can apply Jensen's inequality to the concavity of the logarithm to get However, it's interesting to see this inequality as resulting from the variational derivative of with respect to (This is what is meant when we say that the ELBO follows from a "variational principle.") It's also very interesting to keep in mind that inequalities of the form characterize the function Finally, observe that our whole discussion was driven by the simple idea that "something is convex" and that differentiating convex functions is a good idea. Nothing motivates inequalities quite like convexity!