This post accompanies a poster presented at the MLSS 2024.
Physics likes optimization! Subject to its boundary conditions, the time evolution of a physical system is a critical point for a quantity called an action. This point of view sets the stage for Noether's principle, a remarkable correspondence between continuous invariances of the action and conservation laws of the system.
In machine learning, we often deal with discrete "processes" whose control parameters are chosen to minimize some quantity. For example, we can see a deep residual network as a process where the role of "time" is played by depth. We may ask:
Does Noether's theorem apply to these processes?
Can we find meaningful conserved quantities?
Our answers: "yes," and "not sure!"
In 1630, Fermat observed that the trajectory a beam of light takes through a lens is a path of least time. Formally, where gives the velocity of light at a point a trajectory taken by a beam of light between two points and will be a stationary point for in the sense that for any "perturbation" verifying Furthermore, this condition turns out to characterize the paths that light can take.
In fact, all fundamental physical theories can be written in the form for a suitable "action" where denotes a (variational) derivative with respect to the process trajectory. Such an equation is called a stationary action principle.
Given a physical system, a function of the system state is called a conserved quantity when its value is constant over any given physical trajectory. For dynamics expressed by a stationary action principle, Noether's principle gives a remarkable correspondence between conserved quantities and certain invariances of the action
For the purposes of this article, we will not give more details on Noether's principle. The interested reader is invited to seek out Chapter 4 of Arnold's Mathematical Methods of Classical Mechanics for a concise description from the point of view of Lagrangian mechanics. (For a much more complete reference, including the more subtle "off-shell" version applicable to local gauge transformations, see Chapter 5 of Olver's Application of Lie Groups to Differential Equations.) However, as a guide for the imagination, we illustrate two examples.
In physics, the famous two-body problem is described by the action where and are the positions of our two bodies and and are their masses. The integrand does not change if the trajectories and are translated or rotated, so these transformations are invariants for Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be horizontal, vertical, and angular momentum.
In the illustration above, we consider two bodies with equal mass. Conservation of linear momentum means that the sum of the arrows is constant, while conservation of angular momentum means that the sum of highlighted areas is constant.
For a less familiar example of a conservation law, consider a ray of light passing through a rotationally symmetric lens. Rotating a trajectory about the center of the lens does not affect the time needed to traverse it. What conserved quantity does this symmetry produce?
At each instant, let be a vector pointing in the direction of with norm equal to (We've drawn this vector in the widget above.) This vector can be seen as the gradient of with respect to it gives the marginal price we would pay to move in a certain direction at a particular instant. Noether's principle tells us that is conserved, where is the center of the lens. In our illustration, this conserved quantity is represented by the area of the triangle.
In machine learning, we routinely deal with processes whose control parameters are chosen to minimize some quantity. For example, say we are trying to solve a supervised learning problem by choosing parameters so that a composition minimizes some expected loss with respect to some random variables and
Let's view the sequence of random variables as the states of a discrete-time "process." After optimization, the trajectory is approximately a critical point for a loss function. As we saw above, physical trajectories are also characterized as critical points of certain functions. Can we think about our trajectory of intermediate values from a physical point of view?
Of course, several things are different. Most obviously, "time" is now discrete rather than continuous. Furthermore, our trajectory of intermediate values is constrained in a different way. In physics, we specified boundary conditions and allowed the intermediate states of the process to vary, but in machine learning we specify an initial condition and constrain the trajectory to be driven by some parameters Fortunately, Noether's principle continues to apply!
To see how, let's consider a toy example that's easy to visualize. First, we'll define a family of "deformations" of the plane, as depicted below. (Click the screen or drag the sldier to change the parameters of the deformation.)
Next, we'll set up some simple optimization problems. In each case, we'll try to find a sequence of small deformations that work together to send some "clusters" to their corresponding "destinations." This is easy to do by initializing randomly and optimizing our parameters with gradient descent. In the following widget, a composition of 50 maps is able to swap the positions of three clusters.
Now, observe that our family of deformations is invariant under conjugation by translations and rotations. This means that, where is a translation or rotation, we have an equality of sets. In other words, our family "looks the same" from a translated or rotated coordinate system. Moreover, conjugating a map by a transformation near the identity can be realized by making a small change to the parameters that define As it turns out, this is the right notion of invariance to apply Noether's principle. What conserved quantity do we get?
For each data point, let be the backpropagated gradients of the loss function with respect to the intermediate positions that this data point takes through our "process." Applying a certain discrete Noether's principle, detailed at the end of this post, will turn invariance of under translations into conservation of the expected value from one layer to the next. (You can display the individual vectors in the widget above.) Invariance under rotation gives a conserved quantity reminiscent of angular momentum, namely
In this particular example, it turns out that is also approximately conserved (and nearly zero) over each individual cluster. In the next widget, we add some examples where the per-cluster gradient is not conserved. (Each cluster now has its average gradient drawn with a single arrow. For simplicity, we are not visualizing "angular gradient.")
In the physical world, interactions are often accompanied by transfers of conserved quantities. Think, for example, of particles in a gas exchanging momentum and energy. Conversely, if two systems are not interacting with one another, then conservation laws will hold for each system in isolation. In mechanics, this is known as Newton's first law.
In our first toy example, the subgoals of bringing each cluster to its respective destination could be achieved without compromise between clusters, meaning that each cluster acted like an independent subsystem. This lack of interaction explains the conservation of gradient within each cluster. (In other words, we can apply Newton's first law.) In our next examples, "transfers" of average gradient indicate "interactions" between clusters, meaning that the trajectory of our process is no longer a stationary point for each subgoal considered in isolation. On the other hand, we can sometimes identify subset of clusters over which gradient is approximately conserved.
In general, if the layers of a deep model have meaningful "Noether equivariances," then we can build conserved quantities and perhaps study a physically-inspired notion of "interaction." However, I don't know if this idea works in practice! Can you think of interesting equivariances that we might find on real-world machine learning models? (Keep in mind that they might only be valid near the actual parameters of the model.)
In the above, we've been intentionally vague about the details of Noether's principle and how it applied to our toy example. We end this post with a general proof of our "discrete" Noether's principle.
We'll focus on a single step of an "optimized process." Let describe some family of functions parameterized over and define the action where and are understood to be random variables, and are their associated loss gradients, and inner products are taken in expectation. In the typical context of supervised learning, will parameterize a layer of a deep model, and (for example) stands for a sum where indexes over a dataset and are activation and gradient vectors for the th data point.
The "forward pass" and "backward pass" are encoded by the stationarity of with respect to and since Similarly, first-order optimality of the parameter can be expressed as In practice, this will hold only approximately, so our conservation law will also be approximate. We ignore this detail for simplicity.
Now, suppose that are made to depend on a real-valued parameter and that For example, when we have a Lie group acting on and such that then any one-parameter subgroup of gives rise to such a variation of This is the case for our toy example, where is the group of translations and rotations of the plane. I'm calling this kind of symmetry a Noether equivariance.
Let's compute Applying we find On the other hand, if is stationary with respect to and then can be explained entirely by the variation of which is Overall, we conclude that This is quite analogous to the conservation law that Noether's principle implies for a Hamiltonian system invariant under a family of canonical transformations.
One simple kind of Noether equivariance is a relationship like meaning that our family of maps is expressive enough to be closed under the action of on its outputs. In this case, since the action on is trivial, and the "conservation law" associated with any one-parameter subgroup of simply reads Indeed, this inner product is exactly the derivative of the loss with respect to the transformation of by a given one-parameter subgroup of If variations to can produce these transformations and is chosen optimally, the inner product must be (Informally, this explains why the average gradients in our toy example are approximately )
The same idea applies to a group acting only on and in which case we get For example, if is the input to a linear layer, we get a "Noether equivariance" in the form of an action of on and the layer parameters. Letting range over all linear functions of in the equation above—which we get from the action of one-parameter subgroups of —shows that the tensor product will vanish in expectation if the loss is stationary with respect to the parameters of the linear layer. The general principle we proved above can be understood as the "two-sided" extension of these kinds of observations.