← cgad.ski
2023-11-14

For $1 < p \le q < \infty$ such that $1/p + 1/q = 1,$ Hölder's inequality states that $\int f g \le \left( \int \lvert f \rvert^p \right)^{1/p} \left( \int \lvert g \rvert^q \right)^{1/q}$ so long as $f$ and $g$ are integrable functions (over any measure space) and all three integrals above converge. Furthermore, this inequality is tight in the sense that, for any $f,$ there exists some $g$ making it into a non-trivial equality.

When I was an undergrad, I was a little terrified of inequalities like this one. Their proofs draw on a standard "bag of tricks", but it's not always obvious why to care about the inequalities themselves. Motivation is everything for me, so if I didn't understand why to care I would usually blank on the proof. What is Hölder's inequality really telling us?

One natural interpretation concerns the geometry of the $p$-norms $\lVert f \rVert_p = \left( \int \lvert f \rvert ^p \right)^{1/p}$ for $1 < p < \infty.$ From this perspective, Hölder's inequality is giving us a family of linear lower bounds $\lVert f \rVert_p \ge \frac{\langle f, g \rangle }{\lVert g \rVert_q}$ on the $p$-norm of $f.$ This is meaningful from the point of view of convex geometry; indeed, convex homogeneous functions are precisely the ones that can be written as a supremum of linear functions, and the statement that Hölder's inequality is tight amounts to exactly such a representation for the $p$-norm: $\lVert f \rVert_p = \sup_{g \neq 0} \frac{\langle f, g \rangle}{\lVert g \rVert_q}.$ Conversely, we can rediscover Hölder's inequality by remembering that the $p$-norm is homogeneous and convex (since it's a norm) and wondering about its family of linear lower bounds.

However, last week I was doing some Bayesian statistics and noticed that Hölder's inequality has another *entirely different* interpretation in terms of convexity. If we set $a = 1/p$ and $b = 1/q,$ our pair of conjugate exponents become a pair of coefficients for a convex combination, and Hölder's inequality amounts to the statement that the partition function-style integral
$\int e^{H}$
is log-convex with respect to $H.$ Indeed, a function $v$ is log-convex exactly when
$v(ax + by) \ge v(x)^a v(y)^b$
for all convex combinations $ax + by$ in its domain, and applying Hölder's inequality with $p = 1/a$ and $q = 1/b$ to the integral of $e^{aH + bG} = e^{aH}e^{bG}$ gives
$\int e^{aH} e^{bG} \le \left( \int e^{H} \right)^a\left( \int e^{G} \right)^b.$
(As an aside, it is not true in general that
$f^{-1}\left( \int f(H(x)) \, dx \right)$
is convex as a function of the function $H$ whenever $f \colon \R \to \R$ is convex and strictly increasing. See my stackexchange question. So, the convexity of the $p$-norms and the log-partition function is not true for any simple "generic" reason.)

The simple observation that something is convex gives us access to some formidable insights. For example, suppose that we forgot the Hölder inequality but want to understand how $\lVert f \rVert_p$ behaves near a function $f.$ When a convex function is differentiable, its first-order Taylor approximation gives us a lower bound. Since the $p$-norm is convex, let's go ahead and take a variational derivative, assuming $f > 0$ and $p > 1$ for simplicity. We find that $\begin{align*} \frac{d}{dt}_{t = 0} \lVert f + t h \rVert_p & = \frac{1}{p} \left( \int f^p \right)^{\frac{1}{p} - 1} \langle p f^{p - 1}, h \rangle. \end{align*}$ Since the $p$-norm is homogeneous, the constant term of our first-order approximation ends up vanishing, leaving us with the linear lower bound $\lVert g \rVert_p \ge \left( \int f^p \right)^{\frac{1}{p} - 1} \langle f^{p - 1}, g \rangle$ with equality at $f = g.$ After replacing $f^{p - 1}$ with $f$ and fiddling around with conjugate exponents, we will get the Hölder inequality.

Now, integrals of the sort
$v(H) = \ln \int e^{H}$
come up a lot in Bayesian statistics. For example, when
$H(m) = \ln p(x|m) + \ln p(m),$
gives the joint log-likelihood of a fixed outcome $x$ and a latent parameter $m,$ $v(H)$ is the log-likelihood of the outcome $x.$ The likelihood value itself is known as the *evidence*.

In some sense, integrals like this encode the fundamental difficulty of performing ideal Bayesian inference. Indeed, when $e^H$ is tightly concentrated at some value of $m,$ evaluating nearby integrals ends up being as hard as finding this maximizer. If $m$ indexes over Turing machines—say, to approximate Solomonoff induction—we now have a problem.

What if we hypothesize that $H$ is close to some simpler function $G$ whose exponential integral we *can* evaluate? Since we know that the function $v(H) = \ln \int e^H$ is convex, a variational derivative will produce a lower bound on $v(H)$ in terms of some information about $v$ at $G.$ We compute that
$\begin{align*}
\frac{d}{dt}_{t = 0} \ln \int e^{H + t A} & = \left( \int e^{H} \right)^{-1} \int A e^{H} \\
& = E_H(A)
\end{align*}$
where $E_H$ takes an expectation with respect to the "Boltzmann distribution" whose density is proportional to $e^{H}.$ Where $d_H$ denotes a variational derivative, we could write
$d_H v(H) = d_H \ln \int e^{H} = E_H(-). \tag{1}$
This justifies the claim we made a moment ago that, when the integrand $e^H$ is tightly concentrated, evaluating integrals $v(G)$ for $G \approx H$ amounts to knowing its maximizer. In general, knowing how the evidence varies under infinitesimal changes to the joint log-likelihood $H$ is the same as being able to compute expectations over the posterior distribution. Indeed, the exact value of $E_H(f)$ is given by Bayes' theorem as
$E(f(m)|x) = \frac{\int \exp(\ln f(m) + H(x, m)) \, dm}{\int \exp(H(x, m)) \, dm}.$
From this perspective, our variational expression $(1)$ for the posterior expectation $E_H = E(-|x)$ corresponds to the fact that
$\begin{align*}
E(f(m)|x) & \approx \ln \int \exp(f(m) + H(x, m)) \, dm - \ln \int \exp(H(x, m)) \, dm \\
& = \ln \frac{\int \exp(f(m) + H(x, m)) \, dm}{\int \exp(H(x, m)) \, dm} = \ln E(e^{f(m)}|x)
\end{align*}$
for small $f.$

Finally, notice that the variational derivative $d_H v(H) = E_H(-)$ depends on $H$ only up to a constant. That makes sense, since $v(H + c) = v(H) + c.$ If we translate $H$ to be a normalized log-likelihood, meaning $v(H) = 0,$ we obtain the lower bound
$v(G) \ge E_H(G - H). \tag{2}$
This inequality is known in Bayesian statistics as the *evidence lower bound*, and its slack
$v(G) - E_H(G - H) \ge 0$
is a Kullback–Leibler divergence.

It is straightforward to derive the evidence lower bound directly; since $e^{H}$ integrates to $1,$ we can apply Jensen's inequality to the concavity of the logarithm to get $\ln \int e^{G} = \ln \int e^{G - H} e^{H} \ge E_H(G - H).$ However, it's interesting to see this inequality as resulting from the variational derivative of $v(H)$ with respect to $H.$ (This is what is meant when we say that the ELBO follows from a "variational principle.") It's also very interesting to keep in mind that inequalities of the form $(2)$ characterize the function $v.$ Finally, observe that our whole discussion was driven by the simple idea that "something is convex" and that differentiating convex functions is a good idea. Nothing motivates inequalities quite like convexity!