Suppose that we sample a large number of independent random functions from a certain distribution and propose to solve a regression problem by choosing a linear combination For large adjusting the coefficients to fit some fixed constraints of the form amounts to solving a highly underdetermined linear system, meaning that a high-dimensional space of vectors fit our constraints perfectly. So, choosing one element of requires some additional decision-making. To use the picturesque idea of a "loss landscape" over parameter space, our problem will have a ridge of equally performing parameters rather than just a single optimal peak.
Now, we make a very strange proposal. What if we simply initialize for all and proceed by minimizing some loss function using gradient descent? If all goes well, we should end up with an element of Of course, many elements of give very bad models. To see this, it's enough to remember that we can expect a linear combination of random functions to fit any data points, so if we have data points, there exist models in that perfectly interpolate any adversarial selection of additional data points! Does gradient descent tend to make a "good" choice?
Let's test this empirically. In the widget below, I've chosen functions by sampling trajectories of a Wiener process, also known as Brownian noise. Click anywhere to introduce data points and click the play button in the top right to run gradient descent for the squared loss
Interestingly, the functions we obtain are not that bad. They seem to concentrate around piecewise linear interpolations of our data. In fact, in the limit of many random functions, it turns out that running gradient descent to convergence has a meaningful statistical interpretation. Specifically, if we view the Wiener process as a prior, then running gradient descent to convergence samples from the posterior for our data points. Since many optimal solutions to our minimization problem are meaningless, it is not possible to explain this fact if we see gradient descent as "just some optimization method." What explains its relative success?
As we will show in this post, our intriguing Bayesian interpretation can be explained by the relationship between the behavior of gradient descent steps, the statistical properties of our random functions , and our initialization. In particular, it does not depend on the loss function—so long as it leads gradient descent to converge to an exact interpolation—but does depend significantly on our choice of parameters at initialization. Our analysis will rely on a "tangent kernel" of the sort introduced in the Neural Tangent Kernel paper by Jacot et al.. Specifically, viewing gradient descent as a process occurring in the function space of our regression problem, we will find that its dynamics can be described in terms of a certain kernel function, which in this case is just the kernel function of the process
Of course, there are much easier ways to sample posteriors for low-dimensional Gaussian processes. Nevertheless, it's interesting to notice a relationship between Bayesian inference and gradient descent methods at large, since the latter tend to apply to situations where direct estimation of a posterior distribution is not practical. Furthermore, the results of Jacot suggest that kernel-based interpretations may also hold for large, non-trivial neural networks. To the extent that this is true, we can use the kernel interpretation to reason about many sorts of neural network phenomena, including the benefits of early stopping, the existence of "implicit regularization", and the fact that overparameterization often increases performance despite the apparent risk of overfitting.
In this post, we will focus exclusively on the toy problem introduced above. Our discussion (which admittedly takes a bit of a scenic route) is divided into three sections.
First, we will discover how the covariance kernel of the process is related to the dynamics of gradient descent. (This is a toy case from Jacot's paper.)
Next, we recall the theory of reproducing kernel Hilbert spaces and show how the kernel-based behavior of gradient descent is related to regularization.
Finally, we recall some special properties of Gaussian processes and explain why regularization is related to the Bayesian interpretation of our trained model. (For more details, see Chapter 6 of Gaussian Processes for Machine Learning.)
Let's begin by considering the effect that a single step of gradient descent has on our function In general, the differential of a loss can be written as a sum of differentials where is the evaluation of at an input so by linearity it is enough for us to understand how "responds" to differentials of this form.
In response to the parameters are assigned differentials So, gradient descent will increase proportional to the value of In terms of we find that the value at another input will increase proportional to Note that this expression is independent of the coefficients This means that the gradient descent step we apply to depends only on our learning rate and differential of the loss at In other words, we can view gradient descent as a process happening in the function space of our regression problem.
For large we get the approximation This last expression is familiar in the study of Gaussian processes; it is called the covariance kernel of and denoted For the Wiener process, the covariance kernel takes the form In the following, we'll assume is large enough for us to make the approximation confidently. Then, we conclude that a request of will cause gradient descent to push in the direction of the function
You can get a visual sense of this behavior in the widget below. As above, I've generated trials of the Wiener process to use as my functions You can choose the request by clicking on the graph, and your browser will compute the corresponding response For comparison I've also drawn the prediction
This is already a significant conclusion. In particular, it means that every step of gradient descent modifies by a linear combination of the functions where ranges over the inputs in our training set. Since the linear span of these functions is a certain space of piecewise affine functions, if we initialize and run gradient descent to convergence with any reasonable loss function, we should approximately converge to a piecewise affine interpolation of our data points. We've run this experiment in the widget below.
This new model has less variance than before. In fact, in the large limit, its behavior will be exactly deterministic! In comparison, our previous model will always exhibit variance due to initialization of the functions even for large In other words, when given a finite training set, gradient descent cannot entirely "forget" its initialization even when run to convergence.
Another important conclusion is that, when we optimize least squares with gradient descent, the evolution of is linear in the sense of approximately obeying a linear ODE. Indeed, for data points our loss differential will be So, if we view as evolving under the flow of gradient descent with respect to a continuous parameter we have where the right-hand side is a linear function of the empirical error vector Restricting this equation to the evaluation of over the input points we find that the error vector solves the ODE where is the matrix Since this matrix is positive-definite—it is a covariance matrix—we conclude that will converge to over the training process if our learning rate is sufficiently small. Furthermore, knowing the eigenvalues of lets us understand the nature of our convergence; gradient descent will "correct" the error along the components of the eigenbasis for with largest eigenvalues first, and take longer to correct components with smaller eigenvalues.
Now, we could have chosen a distribution of random functions with a different covariance kernel Here the functions were easy to interpret, but in general, what does it mean to fit a data set using a linear combination of functions like these? One interesting perspective comes from the idea of regularization, which we discuss next.
Consider a Hilbert space equipped with bounded projections for each and suppose that we want to find an element that minimizes some loss function depending on the values for belonging to some collection (Note that elements of can be viewed as functions from to by viewing as the evaluation at ) If this problem is underdetermined—which necessarily will happen if is infinite-dimensional and our collection is finite—then we may ask for an element that both minimizes our loss and has minimal norm in In machine learning, this is called regularization.
Let's write for the product of the projections Because is chosen with minimal norm, it cannot be made smaller by adjusting it by an element of so is orthogonal to But since the maps are continuous, they can be represented by vectors in the sense that (This is the Riesz representation theorem.) Since can be described as the orthogonal complement to the set the orthogonal complement to is exactly the closure of the span of the vectors We conclude that any regularized solution to our loss function is a (limit of) linear combinations of these vectors.
What are the projections of the "representative elements" ? By our own definition, we have for any other This last expression is a positive semidefinite kernel, which we will denote In other words, the norm on and the projections work together to produce a kernel function whose partial evaluations help us solve optimization problems regularized by the norm of
In the literature, a Hilbert space equipped with bounded projections indexed over a set is called a reproducing kernel Hilbert space (or RKHS). In fact, we can also go in the other direction: every positive definite kernel on is "reproduced" by some RKHS, which also turns out to be unique in a certain sense. This is known as the Moore-Aronszajn theorem.
What RKHS corresponds to our kernel ? In general, determining the RKHS of a kernel is not entirely straightforward. In fact, notice that for positive definite kernels over a finite set the inner product for the RKHS expressed in the dual basis for our projections turns out to be the inverse of the matrix encoded by our kernel. Indeed, where are representatives of the projections and is a dual basis verifying we find that So, interpreting a RKHS norm in terms of projections of elements requires solving some sort of inverse problem.
The RKHS for a centered Gaussian process can be viewed as an isometric embedding of the observables with respect to the norm for the process measure Specifically, if we define then clearly Indeed, the space of observables of a Gaussian process is already a RKHS for its covariance kernel, if we take the projections to be the maps However, we would like to view the RKHS more directly as a space of functions.
We may begin by observing, then, that observables of the Wiener process can be isometrically mapped into by sending to Under this perspective, our projections become Ultimately, we are led to view the RKHS of the Wiener process as the Sobolev space of absolutely continuous functions such that and such that the norm is finite. In fact, solving the regularized interpolation problem results in the piecewise affine interpolations we observed in the widget above.
So far, we have shown that its relationship with kernel functions gives gradient descent a distinct flavor of implicit regularization. We did not have a penalty function in mind when we set up our problem, but our distribution of random functions ended up making our gradient updates interpretable in terms of a RKHS for an associated kernel function. In the last section of this post, we address how this fact is related to the statistical idea of a conditional distribution for a Gaussian process.
When and are jointly Gaussian distributed, we know that the remainder of the conditional expectation is independent from So, we can decompose into two components the first being a Gaussian variable independent from and the second being -measurable. This clarifies the nature of the conditional distribution of on : it will have constant variance equal to the variance of and mean equal to a linear function of In particular, if we want to sample the conditional distribution of given we could take where the apparently nonsensical conditional expectation on the RHS should be interpreted as the evaluation of the conditional expectation viewed as a function of at Keep in mind that this is a very special property of Gaussian distributions; in general, the distribution of the remainder conditional on will depend on and so we won't be able to sample the conditional distribution under another "counterfactual" value simply by translating a sample of the remainder.
Now, consider a random function drawn from a Gaussian distribution and let give the values of our trajectory on a finite set of inputs. If we want to produce a sample from the distribution of conditional on some data we can take But, as it turns out, the conditional expectation will be exactly the function in the RKHS of our process that solves the constraint regularized by the RKHS norm! This explains the Bayesian interpretation of our toy model.
One way to understand this is just to write out the expression for at a given value of We know that this will be the linear function of uniquely determined by the equation Where is the covariance matrix of and gives the covariance of with the components of this equation can be written as and We conclude that the function is a linear combination of the functions —the coordinates of the vector —with constant coefficients, determined by the constraints that should agree with at the points But, as we saw above, this is the same as the solution to the problem of interpolating some constraints regularized by the RKHS norm of our process.
To see this connection more directly, remember that the mean of a Gaussian distribution coincides with the mode—the point of highest probability density under linear coordinates. So, for example, is exactly the value of that minimizes where is the inverse of the covariance matrix for But from the previous section we know that expresses the inner product of the RKHS derived from the covariance kernel of in the dual basis for the projections. So in fact we are asking for the value of that satisfies the constraint and is regularized by the RKHS norm corresponding to a restriction of the covariance kernel of our process, which by the representer theorem will be a linear combination of restrictions of functions
Abstractly, whenever we have a positive definite kernel with RKHS and a finite subset we get a natural projection onto the RKHS for the kernel restricted to given by Given a vector what is the norm of in ? Since is an isometry over the span of the elements we can view as the minimum possible norm for an element solving the equation In particular, solving a regularized problem over that depends on projections for and then restricting the solution to is the same as restricting to and solving the regularized problem with respect to the norm on
As a final remark, note that we can informally imagine the RKHS of a Gaussian process as specifying the "energy" of the process in a statistical mechanics sense; although the norm of the RKHS is not defined over the same function space that the process takes values, we get the energy for the joint distribution of any finite projection as a function of by solving This is the most satisfactory way that I've found to connect the interpretation of kernel functions in terms of regularization with their interpretation in terms of conditional expectations of a Gaussian process.