← cgad.ski 2023-11-14

Hölder's Inequality and the Evidence Lower Bound

For $1 < p \le q < \infty$ such that $1/p + 1/q = 1,$ Hölder's inequality states that $\int f g \le \left( \int \lvert f \rvert^p \right)^{1/p} \left( \int \lvert g \rvert^q \right)^{1/q}$ so long as $f$ and $g$ are integrable functions (over any measure space) and all three integrals above converge. Furthermore, this inequality is tight in the sense that, for any $f,$ there exists some $g$ making it into a non-trivial equality.

When I was an undergrad, I was a little terrified of inequalities like this one. Their proofs draw on a standard "bag of tricks", but it's not always obvious why to care about the inequalities themselves. Motivation is everything for me, so if I didn't understand why to care I would usually blank on the proof. What is Hölder's inequality really telling us?

One natural interpretation concerns the geometry of the $p$ -norms $\lVert f \rVert_p = \left( \int \lvert f \rvert ^p \right)^{1/p}$ for $1 < p < \infty.$ From this perspective, Hölder's inequality is giving us a family of linear lower bounds $\lVert f \rVert_p \ge \frac{\langle f, g \rangle }{\lVert g \rVert_q}$ on the $p$ -norm of $f.$ This is meaningful from the point of view of convex geometry; indeed, convex homogeneous functions are precisely the ones that can be written as suprema of linear functions, and the statement that Hölder's inequality is tight amounts to exactly such a representation for the $p$ -norm: $\lVert f \rVert_p = \sup_{g \neq 0} \frac{\langle f, g \rangle}{\lVert g \rVert_q}.$ However, last week I was doing some Bayesian statistics and noticed that Hölder's inequality has another entirely different interpretation in terms of convexity. If we set $a = 1/p$ and $b = 1/q,$ our pair of conjugate exponents become a pair of coefficients for a convex combination, and Hölder's inequality amounts to the statement that the partition function-style integral $\int e^{H}$ is log-convex with respect to $H.$ Indeed, a function $v$ is log-convex exactly when $v(ax + by) \ge v(x)^a v(y)^b$ for all convex combinations $ax + by$ in its domain, and applying Hölder's inequality with $p = 1/a$ and $q = 1/b$ to the integral of $e^{aH + bG} = e^{aH}e^{bG}$ gives $\int e^{aH} e^{bG} \le \left( \int e^{H} \right)^a\left( \int e^{G} \right)^b.$ (As an aside, it is not true in general that $f^{-1}\left( \int f(H(x)) \, dx \right)$ is convex as a function of the function $H$ whenever $f \colon \R \to \R$ is convex and strictly increasing. See my stackexchange question. So, the convexity of the $p$ -norms and the log-partition function is not true for any simple "generic" reason.)

The simple observation that something is convex gives us access to some formidable insights. For example, suppose that we forgot the Hölder inequality but want to understand how $\lVert f \rVert_p$ behaves near a function $f.$ When a convex function is differentiable, its first-order Taylor approximation gives us a lower bound. So, let's go ahead and take a variational derivative. Assuming that $f > 0$ and $p > 1$ for simplicity, we find that $\begin{align*} \frac{d}{dt}_{t = 0} \lVert f + t h \rVert_p & = \frac{1}{p} \left( \int f^p \right)^{\frac{1}{p} - 1} \langle p f^{p - 1}, h \rangle. \end{align*}$ Since the $p$ -norm is homogeneous, the constant term of our first-order approximation ends up vanishing, leaving us with the linear lower bound $\lVert g \rVert_p \ge \left( \int f^p \right)^{\frac{1}{p} - 1} \langle f^{p - 1}, g \rangle$ with equality at $f = g.$ After replacing $f^{p - 1}$ with $f$ and fiddling around with conjugate exponents, we will get the Hölder inequality.

Now, "free energy" expressions of the sort $v(H) = \ln \int e^{H}$ come up a lot in Bayesian statistics. For example, when $H(m) = \ln p(x|m) + \ln p(m)$ gives the joint log-likelihood of a fixed observation $x$ and a latent parameter $m,$ $v(H)$ is the log-likelihood that a distribution $p(m)$ over latent parameters assigns to $x.$ The likelihood value itself is known as the evidence for the distribution $p(m).$

In some sense, integrals like this one typify the difficulty of performing ideal Bayesian inference. Indeed, when $e^{H(m)}$ is tightly concentrated at some value of $m,$ describing how $v(H)$ varies under perturbations to $H$ ends up being as hard as finding this maximizer. If $m$ indexes over Turing machines or large neural networks, we have a problem.

Suppose we have a function $G$ that we can deal with. (For example, it could encode a simpler distribution over $m$ —like a mean-field approximation does in statistical physics.) What can $G$ tell us about $v(H) = \ln \int e^H$ ? Since we know that $v$ is a convex function, a variational derivative at $v(G)$ will produce a lower bound on $v(H)$ in terms of some information about $v$ near $G.$ We compute that $\begin{align*} \frac{d}{dt}_{t = 0} \ln \int e^{H + t A} & = \left( \int e^{H} \right)^{-1} \int A e^{H} \\ & = E_H(A) \end{align*}$ where $E_H$ takes an expectation with respect to the Boltzmann distribution with density proportional to $e^{H}.$ Where $d_H$ denotes a variational derivative, we could write $d_H v(H) = d_H \ln \int e^{H} = E_H(-). \tag{1}$ This justifies the claim we made a moment ago that, when the integrand $e^H$ is tightly concentrated, evaluating integrals $v(G)$ for $G \approx H$ amounts to knowing its maximizer. In general, knowing how the evidence varies under infinitesimal changes to the joint log-likelihood $H$ is the same as being able to compute expectations over the posterior distribution. Indeed, the exact value of $E_H(f)$ is given by Bayes' theorem as $E(f(m)|x) = \frac{\int \exp(\ln f(m) + H(x, m)) \, dm}{\int \exp(H(x, m)) \, dm}.$ From this perspective, our variational expression $(1)$ for the posterior expectation $E_H = E(-|x)$ corresponds to the fact that $\begin{align*} E(f(m)|x) & \approx \ln \int \exp(f(m) + H(x, m)) \, dm - \ln \int \exp(H(x, m)) \, dm \\ & = \ln \frac{\int \exp(f(m) + H(x, m)) \, dm}{\int \exp(H(x, m)) \, dm} = \ln E(e^{f(m)}|x) \end{align*}$ for small $f.$

Finally, notice that the variational derivative $d_H v(H) = E_H(-)$ depends on $H$ only up to a constant. That makes sense, since $v(H + c) = v(H) + c.$ If we translate $H$ to be a normalized log-likelihood, meaning $v(H) = 0,$ we obtain the lower bound $v(G) \ge E_H(G - H). \tag{2}$ This inequality is known in Bayesian statistics as the evidence lower bound, and its slack $v(G) - E_H(G - H) \ge 0$ is the well-known Kullback–Leibler divergence.

It is straightforward to derive the evidence lower bound directly; since $e^{H}$ integrates to $1,$ we can apply Jensen's inequality to the concavity of the logarithm to get $\ln \int e^{G} = \ln \int e^{G - H} e^{H} \ge E_H(G - H).$ However, it's interesting to see this inequality as resulting from the variational derivative of $v(H)$ with respect to $H.$ (This is what is meant when we say that the ELBO follows from a "variational principle.") It's also very interesting to keep in mind that inequalities of the form $(2)$ characterize the function $v.$ Finally, observe that our whole discussion was driven by the simple idea that "something is convex" and that differentiating convex functions is a good idea. Nothing motivates inequalities quite like convexity!

← cgad.ski