← cgad.ski 2023-11-14

Hölder's Inequality and the Evidence Lower Bound

For 1<pq<1 < p \le q < \infty such that 1/p+1/q=1,1/p + 1/q = 1, Hölder's inequality states that fg(fp)1/p(gq)1/q\int f g \le \left( \int \lvert f \rvert^p \right)^{1/p} \left( \int \lvert g \rvert^q \right)^{1/q} so long as ff and gg are integrable functions (over any measure space) and all three integrals above converge. Furthermore, this inequality is tight in the sense that, for any f,f, there exists some gg making it into a non-trivial equality.

When I was an undergrad, I was a little terrified of inequalities like this one. Their proofs draw on a standard "bag of tricks", but it's not always obvious why to care about the inequalities themselves. Motivation is everything for me, so if I didn't understand why to care I would usually blank on the proof. What is Hölder's inequality really telling us?

One natural interpretation concerns the geometry of the pp-norms fp=(fp)1/p\lVert f \rVert_p = \left( \int \lvert f \rvert ^p \right)^{1/p} for 1<p<.1 < p < \infty. From this perspective, Hölder's inequality is giving us a family of linear lower bounds fpf,ggq\lVert f \rVert_p \ge \frac{\langle f, g \rangle }{\lVert g \rVert_q} on the pp-norm of f.f. This is meaningful from the point of view of convex geometry; indeed, convex homogeneous functions are precisely the ones that can be written as suprema of linear functions, and the statement that Hölder's inequality is tight amounts to exactly such a representation for the pp-norm: fp=supg0f,ggq.\lVert f \rVert_p = \sup_{g \neq 0} \frac{\langle f, g \rangle}{\lVert g \rVert_q}. However, last week I was doing some Bayesian statistics and noticed that Hölder's inequality has another entirely different interpretation in terms of convexity. If we set a=1/pa = 1/p and b=1/q,b = 1/q, our pair of conjugate exponents become a pair of coefficients for a convex combination, and Hölder's inequality amounts to the statement that the partition function-style integral eH\int e^{H} is log-convex with respect to H.H. Indeed, a function vv is log-convex exactly when v(ax+by)v(x)av(y)bv(ax + by) \ge v(x)^a v(y)^b for all convex combinations ax+byax + by in its domain, and applying Hölder's inequality with p=1/ap = 1/a and q=1/bq = 1/b to the integral of eaH+bG=eaHebGe^{aH + bG} = e^{aH}e^{bG} gives eaHebG(eH)a(eG)b.\int e^{aH} e^{bG} \le \left( \int e^{H} \right)^a\left( \int e^{G} \right)^b. (As an aside, it is not true in general that f1(f(H(x))dx)f^{-1}\left( \int f(H(x)) \, dx \right) is convex as a function of the function HH whenever f ⁣:RRf \colon \R \to \R is convex and strictly increasing. See my stackexchange question. So, the convexity of the pp-norms and the log-partition function is not true for any simple "generic" reason.)

The simple observation that something is convex gives us access to some formidable insights. For example, suppose that we forgot the Hölder inequality but want to understand how fp\lVert f \rVert_p behaves near a function f.f. When a convex function is differentiable, its first-order Taylor approximation gives us a lower bound. Since the pp-norm is convex, let's go ahead and take a variational derivative, assuming f>0f > 0 and p>1p > 1 for simplicity. We find that ddtt=0f+thp=1p(fp)1p1pfp1,h.\begin{align*} \frac{d}{dt}_{t = 0} \lVert f + t h \rVert_p & = \frac{1}{p} \left( \int f^p \right)^{\frac{1}{p} - 1} \langle p f^{p - 1}, h \rangle. \end{align*} Since the pp-norm is homogeneous, the constant term of our first-order approximation ends up vanishing, leaving us with the linear lower bound gp(fp)1p1fp1,g\lVert g \rVert_p \ge \left( \int f^p \right)^{\frac{1}{p} - 1} \langle f^{p - 1}, g \rangle with equality at f=g.f = g. After replacing fp1f^{p - 1} with ff and fiddling around with conjugate exponents, we will get the Hölder inequality.

Now, "free energy" expressions of the sort v(H)=lneHv(H) = \ln \int e^{H} come up a lot in Bayesian statistics. For example, when H(m)=lnp(xm)+lnp(m)H(m) = \ln p(x|m) + \ln p(m) gives the joint log-likelihood of a fixed observation xx and a latent parameter m,m, v(H)v(H) is the log-likelihood that a distribution p(m)p(m) over latent parameters assigns to x.x. The likelihood value itself is known as the evidence for the distribution p(m).p(m).

In some sense, evaluating integrals like this one summarize the fundamental difficulty of performing ideal Bayesian inference. Indeed, when eH(m)e^{H(m)} is tightly concentrated at some value of m,m, describing how v(H)v(H) varies under perturbations to HH ends up being as hard as finding this maximizer. If mm indexes over Turing machines or large neural networks, we have a problem.

Suppose we have a function GG that we can deal with. (For example, it could encode a simpler distribution over mm—like a mean-field approximation does in statistical physics.) What can GG tell us about v(H)=lneHv(H) = \ln \int e^H? Since we know that vv is a convex function, a variational derivative at v(G)v(G) will produce a lower bound on v(H)v(H) in terms of some information about vv near G.G. We compute that ddtt=0lneH+tA=(eH)1AeH=EH(A)\begin{align*} \frac{d}{dt}_{t = 0} \ln \int e^{H + t A} & = \left( \int e^{H} \right)^{-1} \int A e^{H} \\ & = E_H(A) \end{align*} where EHE_H takes an expectation with respect to the Boltzmann distribution with density proportional to eH.e^{H}. Where dHd_H denotes a variational derivative, we could write dHv(H)=dHlneH=EH().(1)d_H v(H) = d_H \ln \int e^{H} = E_H(-). \tag{1} This justifies the claim we made a moment ago that, when the integrand eHe^H is tightly concentrated, evaluating integrals v(G)v(G) for GHG \approx H amounts to knowing its maximizer. In general, knowing how the evidence varies under infinitesimal changes to the joint log-likelihood HH is the same as being able to compute expectations over the posterior distribution. Indeed, the exact value of EH(f)E_H(f) is given by Bayes' theorem as E(f(m)x)=exp(lnf(m)+H(x,m))dmexp(H(x,m))dm.E(f(m)|x) = \frac{\int \exp(\ln f(m) + H(x, m)) \, dm}{\int \exp(H(x, m)) \, dm}. From this perspective, our variational expression (1)(1) for the posterior expectation EH=E(x)E_H = E(-|x) corresponds to the fact that E(f(m)x)lnexp(f(m)+H(x,m))dmlnexp(H(x,m))dm=lnexp(f(m)+H(x,m))dmexp(H(x,m))dm=lnE(ef(m)x)\begin{align*} E(f(m)|x) & \approx \ln \int \exp(f(m) + H(x, m)) \, dm - \ln \int \exp(H(x, m)) \, dm \\ & = \ln \frac{\int \exp(f(m) + H(x, m)) \, dm}{\int \exp(H(x, m)) \, dm} = \ln E(e^{f(m)}|x) \end{align*} for small f.f.

Finally, notice that the variational derivative dHv(H)=EH()d_H v(H) = E_H(-) depends on HH only up to a constant. That makes sense, since v(H+c)=v(H)+c.v(H + c) = v(H) + c. If we translate HH to be a normalized log-likelihood, meaning v(H)=0,v(H) = 0, we obtain the lower bound v(G)EH(GH).(2)v(G) \ge E_H(G - H). \tag{2} This inequality is known in Bayesian statistics as the evidence lower bound, and its slack v(G)EH(GH)0v(G) - E_H(G - H) \ge 0 is the well-known Kullback–Leibler divergence.

It is straightforward to derive the evidence lower bound directly; since eHe^{H} integrates to 1,1, we can apply Jensen's inequality to the concavity of the logarithm to get lneG=lneGHeHEH(GH).\ln \int e^{G} = \ln \int e^{G - H} e^{H} \ge E_H(G - H). However, it's interesting to see this inequality as resulting from the variational derivative of v(H)v(H) with respect to H.H. (This is what is meant when we say that the ELBO follows from a "variational principle.") It's also very interesting to keep in mind that inequalities of the form (2)(2) characterize the function v.v. Finally, observe that our whole discussion was driven by the simple idea that "something is convex" and that differentiating convex functions is a good idea. Nothing motivates inequalities quite like convexity!

← cgad.ski