← cgad.ski 2024-02-29

Where Is Noether's Principle in Machine Learning?

This post accompanies a poster to be presented at the MLSS 2024.

Physics likes optimization! Subject to its boundary conditions, the time evolution of a physical system is a critical point for a quantity called an action. This point of view sets the stage for Noether's principle, a remarkable correspondence between continuous invariances of the action and conservation laws of the system.

In machine learning, we often deal with discrete "processes" whose control parameters are chosen to minimize some quantity. For example, we can see a deep residual network as a process where the role of "time" is played by depth. We may ask:

  1. Does Noether's theorem apply to these processes?
  2. Can we find meaningful conserved quantities?

Our answers: "yes," and "not sure!"

Noether's Principle in Physics

In 1630, Fermat observed that the trajectory a beam of light takes through a lens is a path of least time. Formally, where v(q)v(q) gives the velocity of light at a point q,q, a trajectory q(t)q(t) taken by a beam of light between two points q(t0)q(t_0) and q(t1)q(t_1) will be a stationary point for S[q]=t0t11v(q)q˙dt,S[q] = \int_{t_0}^{t_1} \frac{1}{v(q)} \lVert \dot q \rVert \, dt, in the sense that ddϵS[q+ϵh]=0\frac{d}{d \epsilon} S[q + \epsilon h] = 0 for any "perturbation" hh verifying h(t0)=h(t1)=0.h(t_0) = h(t_1) = 0. Furthermore, this condition turns out to characterize the paths that light can take.

In fact, all fundamental physical theories can be written in the form δS=0\delta S = 0 for a suitable "action" S,S, where δ\delta denotes a (variational) derivative with respect to the process trajectory. Such an equation is called a stationary action principle.

Given a physical system, a function F(q,q˙,t,)F(q, \dot q, t, \dots) of the system state is called a conserved quantity when its value is constant over any given physical trajectory. For dynamics expressed by a stationary action principle, Noether's principle gives a remarkable correspondence between conserved quantities and certain invariances of the action S.S.

For the purposes of this article, we will not give more details on Noether's principle. The interested reader is invited to seek out Chapter 4 of Arnold's Mathematical Methods of Classical Mechanics for a concise description from the point of view of Lagrangian mechanics. (For a much more complete reference, including the more subtle "off-shell" version applicable to local gauge transformations, see Chapter 5 of Olver's Application of Lie Groups to Differential Equations.) However, as a guide for the imagination, we illustrate two examples.

In physics, the famous two-body problem is described by the action S=12(m1q˙12+m2q˙22)m1m2Gq1q2dt,S = \int \frac{1}{2} (m_1 \dot q_1^2 + m_2 \dot q_2^2) - \frac{m_1 m_2 G}{\lVert q_1 - q_2 \rVert} \, dt, where q1q_1 and q2q_2 are the positions of our two bodies and m1m_1 and m2m_2 are their masses. The integrand does not change if the trajectories q1(t)q_1(t) and q2(t)q_2(t) are translated or rotated, so these transformations are invariants for S.S. Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be horizontal, vertical, and angular momentum.

In the illustration above, we consider two bodies with equal mass. Conservation of linear momentum means that the sum of the arrows is constant, while conservation of angular momentum means that the sum of highlighted areas is constant.

For a less familiar example of a conservation law, consider a ray of light passing through a rotationally symmetric lens. Rotating a trajectory about the center of the lens does not affect the time needed to traverse it. What conserved quantity does this symmetry produce?

At each instant, let pp be a vector pointing in the direction of q˙\dot q with norm equal to 1/v(q).1/v(q). (We've drawn this vector in the widget above.) This vector can be seen as the gradient of SS with respect to q;q; it gives the marginal price we would pay to move qq in a certain direction at a particular instant. Noether's principle tells us that p(qc)p \wedge (q - c) is conserved, where cc is the center of the lens. In our illustration, this conserved quantity is represented by the area of the triangle.

Noether's Principle in ML: A Toy Example

In machine learning, we routinely deal with processes whose control parameters are chosen to minimize some quantity. For example, say we are trying to solve a supervised learning problem by choosing parameters (θ1,,θN)(\theta_1, \dots, \theta_N) so that a composition F=φN(θN)φ2(θ2)φ1(θ1)F = \varphi_N(\theta_N) \circ \dots \circ \varphi_2(\theta_2) \circ \varphi_1(\theta_1) minimizes some expected loss E[L(F(x),y)]E[L(F(x), y)] with respect to some random variables xx and y.y.

Let's view the sequence of random variables x0=xx1=φ1(θ1)(x0)xN=φN(θN)(xN1)\begin{align*} x_0 &= x \\ x_1 &= \varphi_1(\theta_1)(x_0) \\ & \vdots \\ x_N &= \varphi_N(\theta_N)(x_{N - 1}) \end{align*} as the states of a discrete-time "process." After optimization, the trajectory (x0,,xN)(x_0, \dots, x_N) is approximately a critical point for a loss function. As we saw above, physical trajectories are also characterized as critical points of certain functions. Can we think about our trajectory of intermediate values xix_i from a physical point of view?

Of course, several things are different. Most obviously, "time" is now discrete rather than continuous. Furthermore, our trajectory of intermediate values is constrained in a different way. In physics, we specified boundary conditions and allowed the intermediate states of the process to vary, but in machine learning we specify an initial condition and constrain the trajectory (x0,,xN)(x_0, \dots, x_N) to be driven by some parameters θi.\theta_i. Fortunately, Noether's principle continues to apply!

To see how, let's consider a toy example that's easy to visualize. First, we'll define a family of "deformations" of the plane, as depicted below. (Click the screen or drag the sldier to change the parameters of the deformation.)

Next, we'll set up some simple optimization problems. In each case, we'll try to find a sequence of small deformations that work together to send some "clusters" to their corresponding "destinations." This is easy to do by initializing randomly and optimizing our parameters with gradient descent. In the following widget, a composition of 50 maps is able to swap the positions of three clusters.

Now, observe that our family F\Fc of deformations is invariant under conjugation by translations and rotations. This means that, where QQ is a translation or rotation, we have an equality F={QfQ1fF}\Fc = \{ Q \circ f \circ Q^{-1} \mid f \in \Fc \} of sets. In other words, our family F\Fc "looks the same" from a translated or rotated coordinate system. Moreover, conjugating a map ff by a transformation QQ near the identity can be realized by making a small change to the parameters that define f.f. As it turns out, this is the right notion of invariance to apply Noether's principle. What conserved quantity do we get?

For each data point, let (gi)i=1N(g_i)_{i = 1}^N be the backpropagated gradients of the loss function with respect to the intermediate positions (xi)i=1N(x_i)_{i = 1}^N that this data point takes through our "process." Applying a certain discrete Noether's principle, detailed at the end of this post, will turn invariance of F\Fc under translations into conservation of the expected value E[gi]E[g_i] from one layer to the next. (You can display the individual vectors gig_i in the widget above.) Invariance under rotation gives a conserved quantity reminiscent of angular momentum, namely E[gixi].E[g_i \wedge x_i].

In this particular example, it turns out that E[gi]E[g_i] is also approximately conserved (and nearly zero) over each individual cluster. In the next widget, we add some examples where the per-cluster gradient is not conserved. (Each cluster now has its average gradient drawn with a single arrow. For simplicity, we are not visualizing "angular gradient.")

In the physical world, interactions are often accompanied by transfers of conserved quantities. Think, for example, of particles in a gas exchanging momentum and energy. Conversely, if two systems are not interacting with one another, then conservation laws will hold for each system in isolation. In mechanics, this is known as Newton's first law.

In our first toy example, the subgoals of bringing each cluster to its respective destination could be achieved without compromise between clusters, meaning that each cluster acted like an independent subsystem. This lack of interaction explains the conservation of gradient within each cluster. (In other words, we can apply Newton's first law.) In our next examples, "transfers" of average gradient indicate "interactions" between clusters, meaning that the trajectory of our process is no longer a stationary point for each subgoal considered in isolation. On the other hand, we can sometimes identify subset of clusters over which gradient is approximately conserved.

In general, if the layers of a deep model have meaningful "Noether equivariances," then we can build conserved quantities and perhaps study a physically-inspired notion of "interaction." However, I don't know if this idea works in practice! Can you think of interesting equivariances that we might find on real-world machine learning models? (Keep in mind that they might only be valid near the actual parameters of the model.)

Proof of a Discrete Noether's Principle

In the above, we've been intentionally vague about the details of Noether's principle and how it applied to our toy example. We end this post with a general proof of our "discrete" Noether's principle.

We'll focus on a single step of an "optimized process." Let φ ⁣:Θ×XY\varphi \colon \Theta \times X \to Y describe some family of functions parameterized over Θ,\Theta, and define the action S=gx,xgy,φθ(x)yS = \langle g_x, x \rangle - \langle g_y, \varphi_\theta(x) - y \rangle where xx and yy are understood to be random variables, gxg_x and gyg_y are their associated loss gradients, and inner products are taken in expectation. In the typical context of supervised learning, φ\varphi will parameterize a layer of a deep model, and gx,x\langle g_x, x \rangle (for example) stands for a sum 1ni=1ngx(i),x(i)\frac{1}{n} \sum_{i = 1}^n \langle g_x(i), x(i) \rangle where ii indexes over a dataset and (x(i),gx(i))(x(i), g_x(i)) are activation and gradient vectors for the iith data point.

The "forward pass" y=φθ(x)y = \varphi_\theta(x) and "backward pass" gx=(φ/x)Tgyg_x = (\partial \varphi / \partial x)^T g_y are encoded by the stationarity of SS with respect to gyg_y and x,x, since Sgy=φθ(x)y,Sx=gx(φx)Tgy.\frac{\partial S}{\partial g_y} = \varphi_\theta(x) - y, \quad \frac{\partial S}{\partial x} = g_x - \left( \frac{\partial \varphi}{\partial x}\right)^T g_y. Similarly, first-order optimality of the parameter θ\theta can be expressed as Sθ=0.\frac{\partial S}{\partial \theta} = 0. In practice, this will hold only approximately, so our conservation law will also be approximate. We ignore this detail for simplicity.

Now, suppose that (x,y,θ)(x, y, \theta) are made to depend on a real-valued parameter ϵ\epsilon and that ϵgy,φθ(x)y=0.\begin{align*} \frac{\partial}{\partial \epsilon} \langle g_y, \varphi_\theta(x) - y \rangle = 0.\tag{1} \end{align*} For example, when we have a Lie group GG acting on X,X, Θ\Theta and YY such that φgθ(gx)=gφθ(x),\varphi_{g \cdot \theta}(g \cdot x) = g \cdot \varphi_\theta(x), then any one-parameter subgroup of GG gives rise to such a variation of (x,y,θ).(x, y, \theta). This is the case for our toy example, where GG is the group SE(2)\operatorname{SE}(2) of translations and rotations of the plane. I'm calling this kind of symmetry a Noether equivariance.

Let's compute S/ϵ.\partial S / \partial \epsilon. Applying (1),(1), we find Sϵ=gx,xϵ.\frac{\partial S}{\partial \epsilon} = \left\langle g_x, \frac{\partial x}{\partial \epsilon} \right\rangle. On the other hand, if SS is stationary with respect to xx and θ,\theta, then S/ϵ\partial S / \partial \epsilon can be explained entirely by the variation of y,y, which is Sϵ=Syyϵ=gy,yϵ.\frac{\partial S}{\partial \epsilon} = \frac{\partial S}{\partial y} \frac{\partial y}{\partial \epsilon} = \left\langle g_y, \frac{\partial y}{\partial \epsilon} \right\rangle. Overall, we conclude that gx,xϵ=gy,yϵ.\left\langle g_x, \frac{\partial x}{\partial \epsilon} \right\rangle = \left\langle g_y, \frac{\partial y}{\partial \epsilon} \right\rangle. This is quite analogous to the conservation law ddtp,qϵ=0\frac{d}{dt} \left\langle p, \frac{\partial q}{\partial \epsilon} \right\rangle = 0 that Noether's principle implies for a Hamiltonian system invariant under a family of canonical transformations.

One simple kind of Noether equivariance is a relationship like φgθ(x)=gφθ(x),\varphi_{g \cdot \theta}(x) = g \cdot\varphi_\theta(x), meaning that our family of maps is expressive enough to be closed under the action of GG on its outputs. In this case, since the action on xx is trivial, x/ϵ=0\partial x / \partial \epsilon = 0 and the "conservation law" associated with any one-parameter subgroup of GG simply reads 0=gy,yϵ.0 = \left\langle g_y, \frac{\partial y}{\partial \epsilon} \right\rangle. Indeed, this inner product is exactly the derivative of the loss with respect to the transformation of yy by a given one-parameter subgroup of G.G. If variations to θ\theta can produce these transformations and θ\theta is chosen optimally, the inner product must be 0.0. (Informally, this explains why the average gradients in our toy example are approximately 0.0.)

The same idea applies to a group acting only on xx and θ,\theta, in which case we get gx,xϵ=0.\left\langle g_x, \frac{\partial x}{\partial \epsilon} \right\rangle = 0. For example, if xRnx \in \R^n is the input to a linear layer, we get a "Noether equivariance" in the form of an action of GL(n)\GL(n) on xx and the layer parameters. Letting x/ϵ\partial x / \partial \epsilon range over all linear functions of xx in the equation above—which we get from the action of one-parameter subgroups of GL(n)\GL(n)—shows that the tensor product gxxg_x \otimes x will vanish in expectation if the loss is stationary with respect to the parameters of the linear layer. The general principle we proved above can be understood as the "two-sided" extension of these kinds of observations.

← cgad.ski