← cgad.ski
2023-04-17

In my post last week on the policy gradient theorem, I mentioned that the matrix inverse has a nice derivative: where we denote $N(A) = A^{-1}$ and $dN(A, B) = d/dt N(A + t B),$ $dN(A, B) = -N(A) B N(A).$ After writing this, I wondered what other differential equations of this sort have solutions. For a simple example, consider $dN(A, B) = N(A)B.$ Setting $f(t) = N(tA)$ and differentiating in $t$ gives $\frac{d}{dt}f(t) = dN(tA, A) = N(t A) A = f(t) A$ which implies $N(A) = f(1) = C e^{A}$ for a choice of initial condition $f(0) = C.$ However, we know that $\frac{d}{dt} e^{A + t B} \neq e^A B$ in general for non-commutative $A$ and $B$. In fact, the only function $N$ solving this equation for $n \times n$ matrices with $n > 1$ turns out to be $N = 0$. How can we tell in advance when such a differential equation will have non-trivial solutions?

Let's consider this problem from the point of view of differential geometry. Let $f \colon \R^n \to \R^m$ be a smooth map and let $df$ be its total differential. We are curious to know when the equation $df(x) = g(f(x), x)$ admits local solutions. Clearly, making nice regularity assumptions about $g$—which we will make use of liberally in the following—solutions to this equation are unique over path-connected domains, should they exist. Indeed, taking some path $\gamma(t)$, the equation above gives us enough information to compute $\frac{d}{dt}f(\gamma(t)) = g(f(\gamma(t)), \gamma(t))(\dot \gamma(t)).$ Thus, subject to a choice for the value of $f$ at some point along $\gamma$, $f(\gamma(t))$ is determined as a unique solution to this ODE. On the other hand, it may happen that the values of $f$ we obtain by fixing its value at some point and solving ODEs along different paths are not path-independent, in which case our equation won't have a solution.

An intuition for differential geometry tells us that path-independence of this integral over a simply connected domain will come down to a system of equations involving the first partial derivatives of $g$. The nicest way to figure out exactly what these are is by using the Frobenius theorem. Let us introduce coordinates $(x_1, \dots, x_n, f_1, \dots, f_m)$ on $\R^n \times \R^m$ and define the vector fields $X_i = \frac{\partial}{\partial x_i} + g_{i,j} \frac{\partial}{\partial f_j}.$ It is fairly clear that a (local) solution to our equation is the same as an integral submanifold for the distribution spanned by $\{X_1, \ldots, X_n\}$. Furthermore, involutivity of our distribution in this case boils down to the equations $[X_i, X_j] = 0,$ for the simple reason that $[X_i, X_j]$ is, at each point, a linear combination of the tangent vectors $\partial / \partial f_i$, and our distribution admits no elements of this form except $0$.

These brackets are readily computed, taking a bit of care with indices of summation: $\begin{align*} [X_i, X_j] & = \left[ \frac{\partial}{\partial x_i} + g_{a,i} \frac{\partial}{\partial f_a}, \frac{\partial}{\partial x_j} + g_{b, j} \frac{\partial}{\partial f_b}\right] \\ & = \left[ \frac{\partial}{\partial x_i}, g_{b, j} \frac{\partial}{\partial f_b} \right] - \left[ \frac{\partial}{\partial x_j}, g_{a,i} \frac{\partial}{\partial f_a} \right] + \left[ g_{a,i} \frac{\partial}{\partial f_a}, g_{b,j} \frac{\partial}{\partial f_b} \right] \\ & = \left( \frac{\partial g_{k,j}}{\partial x_i} - \frac{\partial g_{k,i}}{\partial x_j} \\ \ + g_{a,i} \frac{\partial g_{k,j}}{\partial f_a} - g_{a,j} \frac{\partial g_{k,i}}{\partial f_a} \right) \frac{\partial}{\partial f_k}. \end{align*}$ From this, we can (perhaps in a future post) understand something about what matrix operator differential equations can be integrated.

This result is actually an exercise in Lee's *Introduction to Smooth Manifolds* in the chapter on the Frobenius theorem. However, if we forget to use the Frobenius theorem—as I did, when I considered this question a few days ago—we can also discover the utility of Lie brackets for ourselves.

Suppose for simplicity that $f$ is real-valued, and consider an equation of the simpler type $df(x) = g(x) = g_i(x) d x_i.$ (Here, $g$ is a differential form.) If a solution exists, we can recover it by integrating $g$ over paths. Furthermore, path-independence of our integral is the same as saying that it vanishes over loops.

When do our loop integrals vanish? Stoke's theorem gives the answer: when our domain is simply connected, every loop is the boundary of a disk, a disk can be partitioned into little tiny subregions, and integrals over big loops are sums of many integrals over little tiny loops. So for our equation to be integrable, we just have to check that the $2$-form telling us the integral of $g$ over little tiny loops—its exterior derivative—vanishes.

Let's recall the usual way these little-tiny-loop-integrals are computed. Think of a function $f \colon \R^n \to \R$ near the origin and integrate over loops $\square_{i, j}(\epsilon)$ traversing the points $(0, \epsilon e_i, \epsilon (e_i + e_j), \epsilon e_j, 0)$ for two basis vectors $e_i$ and $e_j$. We have $\begin{align*} \int_{\square_{i, j}(\epsilon)} g & = \int_0^\epsilon g_i(t e_i) + g_j(\epsilon e_i + t e_j) - g_i(t e_i + \epsilon e_j) - g_j(t e_j) \, dt. \end{align*}$ By making substitutions of the form $g_i(x + \epsilon e_j) - g_i(x) = \int_0^\epsilon \frac{\partial g_i}{\partial x_j}(x + s e_j) \, ds$ within the integral, we get the Stoke's theorem-type formula $\begin{align*} \int_{\square_{i,j}(\epsilon)} g & = \int_0^\epsilon \int_0^\epsilon \frac{\partial g_j}{\partial x_i}(t e_j + s e_i) - \frac{\partial g_i}{\partial x_j}(t e_i + s e_j) \, ds\,dt \\ & = \int_0^\epsilon \int_0^\epsilon \left(\frac{\partial g_j}{\partial x_i} - \frac{\partial g_i}{\partial x_j}\right)(t e_i + s e_j) \, ds\,dt. \end{align*}$ In particular, the approximation for small $\epsilon$ is $\int_{\square_{i,j}(\epsilon)} g = \epsilon^2 \left(\frac{\partial g_j}{\partial e_i} - \frac{\partial g_i}{\partial e_j} \right)(0) + O(\epsilon^3).$ The scalars $\partial g_j / \partial e_i - \partial g_i / \partial e_j$ are exactly the coefficients of the exterior derivative $dg$.

Can we do the same thing for a system of equations $d f_i(x) = g_i(f(x), x) = g_{i, j}(f(x), x) d x_j?$ Path integrals of a sort can still be defined; for any path $\gamma$ parameterized on $[0, 1]$ and initial value $f_0 \in \R^m$, define $G_{\gamma}(f_0) \in \R^m$ to equal $p(1)$ where $p(t) \colon [0, 1] \to \R^m$ solves the ODE $\begin{cases} p(0) = f_0 \\ \dot p_i(t) = g_{i,j}(p(t), \gamma(t)) \dot \gamma_j(t). \end{cases}$ Figuring out a Stoke's theorem in this situation will be considerably more confusing, since a path integral over an infinitesimal loop is now a vector field in the domain of $f$. Nevertheless, we have the inkling that taking the limit $\lim_{ \epsilon \to 0 } \frac{1}{\epsilon^2}G_{\square_{i, j}(\epsilon)}(v)$ should give us some functions playing a role similar to the one played above by the coefficients of the exterior derivative. Unfortunately, computing this limit seems pretty hopeless without either divine intuition or hard work.

One easy way out is to forget about path integrals and instead use the symmetry of partial derivatives $\frac{\partial^2 f_k}{\partial x_i \partial x_j} = \frac{\partial^2 f_k}{\partial x_j \partial x_i}$ of a prospective solution $f$. Taking partial derivatives with respect to $x_i$ of $\frac{\partial f_k}{\partial x_j} = g_{k, j}(f(x), x)$ gives $\frac{\partial^2 f_k}{\partial x_i \partial x_j} = \frac{\partial}{\partial x_i} g_{k, j}(f(x), x) = \frac{\partial g_{k, j}}{\partial x_i} + g_{a,i} \frac{\partial g_{k, j}}{\partial f_a},$ from which we conclude that $\frac{\partial g_{k, j}}{\partial x_i} + g_{a,i} \frac{\partial g_{k, j}}{\partial f_a} - \frac{\partial g_{k, i}}{\partial x_j} - g_{a,j} \frac{\partial g_{k, i}}{\partial f_a} = 0.$ Actually, these functions are exactly what we would find by computing the approximation to $G_{\square_{i, j}(\epsilon)}(v)$, as we will see next. The "divine intuition" we will need is to relate the flow of vector fields to the Lie bracket.

One of the many fun ways you get Lie brackets to show up is by developing a commutator product of formal exponential series in two non-commuting variables $X$ and $Y$: $e^{\epsilon X} e^{\epsilon Y} e^{-\epsilon X} e^{-\epsilon Y} = I + [X, Y] \epsilon^2 + O(\epsilon^3).$ The operator sending vector fields to their flows can also be dealt with formally as an exponential map. Specifically, when $\phi_{\epsilon X}(p)$ denotes the image of $p \in \R^n$ under the flow of $X$ for time $\epsilon$, the estimate $\phi_{-\epsilon Y} \circ \phi_{-\epsilon X} \circ \phi_{\epsilon Y} \circ \phi_{\epsilon X}(p) = p + \epsilon^2 [X, Y](p) + O(\epsilon^3).$ can be proven from the formal calculation above by looking at things from the right angle. The key observation is that, if $p(t)$ is an integral curve of $X$, then for any smooth function $f$ we have $\frac{d}{dt} f(p(t)) = df(X(p(t))) = X(f)(p(t)).$ Iterating then gives $\left(\frac{d^k}{dt^k}\right)_{t = 0} f(p(t)) = X^k(f)(p_0).$ Using the suggestive notation $e^{\epsilon X}(f) = f \circ \phi_{\epsilon X}$, we conclude that $\sum_{k = 0}^\infty \frac{\epsilon^k}{k!} X^k(f)$ gives the right power series for $e^{\epsilon X}$. More precisely, we mean that evaluating this series at any point $p$ encodes the higher derivatives of $e^{\epsilon X}(f)(p) = f(\phi_{\epsilon X}(p))$ with respect to $\epsilon$. Furthermore, a bit of thought reveals that this correspondence from families of smooth maps to formal power series of differential operators (really, asymptotic series) is an antihomomorphism for composition. This curious observation simplifies the derivation of many series approximations involving compositions of flows; for example, $\begin{align*} f(\phi_{\epsilon X}(\phi_{\epsilon Y}(p))) & = (e^{\epsilon Y} e^{\epsilon X})(f)(p) \\ & = \left( 1 + (X + Y)\epsilon + \left(\frac{X^2 + Y^2}{2} + YX \right) \epsilon^2 + \dots\right)(f)(p). \end{align*}$ With this result in hand, let's return to our equation $d f_i(x) = g_i(f(x), x) = g_{i, j}(f(x), x) d x_j.$ The vector fields $X_i = \frac{\partial}{\partial x_i} + g_{i,j} \frac{\partial}{\partial f_j}$ that we defined above have another use: their flows give "path integrals" along coordinate axes. In particular, $\begin{align*} G_{\square_{i, j}(\epsilon)}(v_0) & = v(\phi_{-\epsilon X_j} \circ \phi_{-\epsilon X_i} \circ \phi_{\epsilon X_j} \circ \phi_{\epsilon X_i}(v_0, 0)) \\ & = v_0 + \epsilon^2 v([X_i, X_j](v_0, 0)) + O(\epsilon^3). \end{align*}$ For me, this is a nice way to see why Lie brackets express the integrability conditions of our differential equation so well.