← cgad.ski
2023-10-26

Suppose that we sample a large number $N$ of independent random functions $f_i \colon \R \to \R$ from a certain distribution $\Fc$ and propose to solve a regression problem by choosing a linear combination
$\bar{f} = \sum_i \lambda_i f_i.$
For large $N,$ adjusting the coefficients $\lambda_i$ to fit some fixed constraints of the form $\bar{f}(t_i) = y_i$ amounts to solving a highly underdetermined linear system, meaning that a high-dimensional space $\Lambda$ of vectors $(\lambda_1, \dots, \lambda_N)$ fit our constraints perfectly. So, choosing one element of $\Lambda$ requires some additional decision-making. To use the picturesque idea of a
"loss landscape" over parameter space, our problem will have a *ridge* of equally performing parameters rather than just a single optimal *peak*.

Now, we make a very strange proposal. What if we simply initialize $\lambda_i = 1/\sqrt{n}$ for all $i$ and proceed by minimizing some loss function using gradient descent? If all goes well, we should end up with an element of $\Lambda.$ Of course, many elements of $\Lambda$ give very bad models. To see this, it's enough to remember that we can expect a linear combination of $N$ random functions to fit *any* $N$ data points, so if we have $m$ data points, there exist models in $\Lambda$ that perfectly interpolate any adversarial selection of $N - m$ additional data points! Does gradient descent tend to make a "good" choice?

Let's test this empirically. In the widget below, I've chosen functions $f_i$ by sampling $200$ trajectories of a Wiener process, also known as Brownian noise. Click anywhere to introduce data points and click the play button in the top right to run gradient descent for the squared loss $\sum_i (\bar{f}(t_i) - y_i)^2.$

Interestingly, the functions we obtain are not that bad. They seem to concentrate around piecewise linear interpolations of our data. In fact, in the limit $N \to \infty$ of many random functions, it turns out that running gradient descent to convergence has a meaningful statistical interpretation. Specifically, if we view the Wiener process $\Fc$ as a prior, then **running gradient descent to convergence samples from the posterior for our data points.** Since many optimal solutions to our minimization problem are meaningless, it is not possible to explain this fact if we see gradient descent as "just some optimization method." What explains its relative success?

As we will show in this post, our intriguing Bayesian interpretation can be explained by the relationship between the behavior of gradient descent steps, the statistical properties of our random functions $f_i$, and our initialization. In particular, it does *not* depend on the loss function—so long as it leads gradient descent to converge to an exact interpolation—but *does* depend significantly on our choice of parameters at initialization. Our analysis will rely on a "tangent kernel" of the sort introduced in the *Neural Tangent Kernel* paper by Jacot et al.. Specifically, viewing gradient descent as a process occurring in the function space of our regression problem, we will find that its dynamics can be described in terms of a certain kernel function, which in this case is just the kernel function of the process $\Fc.$

Of course, there are much easier ways to sample posteriors for low-dimensional Gaussian processes. Nevertheless, it's interesting to notice a relationship between Bayesian inference and gradient descent methods at large, since the latter tend to apply to situations where direct estimation of a posterior distribution is not practical. Furthermore, the results of Jacot suggest that kernel-based interpretations may also hold for large, non-trivial neural networks. To the extent that this is true, we can use the kernel interpretation to reason about many sorts of neural network phenomena, including the benefits of early stopping, the existence of "implicit regularization", and the fact that overparameterization often *increases* performance despite the apparent risk of overfitting.

In this post, we will focus exclusively on the toy problem introduced above. Our discussion (which admittedly takes a bit of a scenic route) is divided into three sections.

- First, we will discover how the covariance kernel of the process $\Fc$ is related to the dynamics of gradient descent. (This is a toy case from Jacot's paper.)
- Next, we recall the theory of reproducing kernel Hilbert spaces and show how the kernel-based behavior of gradient descent is related to regularization.
- Finally, we recall some special properties of Gaussian processes and explain why regularization is related to the Bayesian interpretation of our trained model. (For more details, see Chapter 6 of Gaussian Processes for Machine Learning.)

Let's begin by considering the effect that a single step of gradient descent has on our function $\bar{f}.$ In general, the differential of a loss can be written as a sum of differentials $d \pi_t$ where $\pi_t$ is the evaluation of $\bar{f}$ at an input $t,$ so by linearity it is enough for us to understand how $\bar{f}$ "responds" to differentials of this form.

In response to $d \pi_t,$ the parameters $\lambda_i$ are assigned differentials $\frac{\partial \pi_t}{\partial \lambda_i} = f_i(t).$ So, gradient descent will increase $\lambda_i$ proportional to the value of $f_i(t).$ In terms of $\bar{f},$ we find that the value $\bar{f}(s)$ at another input $s$ will increase proportional to $\Delta_t(s) = \sum_{i = 1}^N f_i(s) f_i(t). \tag{1}$ Note that this expression is independent of the coefficients $\lambda_i.$ This means that the gradient descent step we apply to $\bar{f}$ depends only on our learning rate and differential of the loss at $\bar{f}.$ In other words, we can view gradient descent as a process happening in the function space of our regression problem.

For large $N$ we get the approximation $\frac{1}{N} \Delta_t(s) \approx E_{f \sim \Fc}[f(s) f(t)]. \tag{2}$ This last expression is familiar in the study of Gaussian processes; it is called the covariance kernel of $\Fc$ and denoted $K(s, t).$ For the Wiener process, the covariance kernel takes the form $K(s, t) = \min(s, t).$ In the following, we'll assume $N$ is large enough for us to make the approximation $(2)$ confidently. Then, we conclude that a request of $d \pi_t$ will cause gradient descent to push $\bar{f}$ in the direction of the function $K(-, t).$

You can get a visual sense of this behavior in the widget below. As above, I've generated $200$ trials of the Wiener process to use as my functions $f_i.$ You can choose the request $d \pi_t$ by clicking on the graph, and your browser will compute the corresponding response $\Delta_t/N.$ For comparison I've also drawn the prediction $K(-, t).$

This is already a significant conclusion. In particular, it means that every step of gradient descent modifies $\bar{f}$ by a linear combination of the functions $K(-, t_i),$ where $t_i$ ranges over the inputs in our training set. Since the linear span of these functions is a certain space of piecewise affine functions, if we initialize $\lambda_i = 0$ and run gradient descent to convergence with *any* reasonable loss function, we should approximately converge to a piecewise affine interpolation of our data points. We've run this experiment in the widget below.

This new model has less variance than before. In fact, in the large $N$ limit, its behavior will be exactly deterministic! In comparison, our previous model will *always* exhibit variance due to initialization of the functions $f_i$ even for large $N.$ In other words, when given a finite training set, gradient descent cannot entirely "forget" its initialization even when run to convergence.

Another important conclusion is that, when we optimize least squares with gradient descent, the evolution of $\bar{f}$ is linear in the sense of approximately obeying a linear ODE. Indeed, for data points $(t_i, y_i)$ our loss differential will be $\frac{1}{2} d \sum_i (\bar{f}(t_i) - y_i)^2 = \sum_i d \pi_{t_i} (\bar{f}(t_i) - y_i).$ So, if we view $\bar{f}$ as evolving under the flow of gradient descent with respect to a continuous parameter $\tau,$ we have $\frac{d \bar{f}}{d \tau} = \sum_i K(-, t_i) (y_i - \bar{f}(t_i)),$ where the right-hand side is a linear function of the empirical error vector $h = (y_i - \bar{f}(t_i)).$ Restricting this equation to the evaluation of $\bar{f}$ over the input points $t_i,$ we find that the error vector $h$ solves the ODE $\frac{d h}{d \tau} = -K h$ where $K$ is the matrix $[K(t_i, t_j)].$ Since this matrix is positive-definite—it is a covariance matrix—we conclude that $h$ will converge to $0$ over the training process if our learning rate is sufficiently small. Furthermore, knowing the eigenvalues of $K$ lets us understand the nature of our convergence; gradient descent will "correct" the error $h$ along the components of the eigenbasis for $K$ with largest eigenvalues first, and take longer to correct components with smaller eigenvalues.

Now, we could have chosen a distribution $\Fc$ of random functions with a different covariance kernel $K.$ Here the functions $K(-, t)$ were easy to interpret, but in general, what does it mean to fit a data set using a linear combination of functions like these? One interesting perspective comes from the idea of *regularization*, which we discuss next.

Consider a Hilbert space $H$ equipped with bounded projections $\pi_t \colon H \to \R$ for each $t \in \R,$ and suppose that we want to find an element $v \in H$ that minimizes some loss function depending on the values $\pi_{t}(v)$ for $t$ belonging to some collection $\{t_i\}.$ (Note that elements of $H$ can be viewed as functions from $\R$ to $\R$ by viewing $\pi_t$ as the evaluation at $t.$) If this problem is underdetermined—which necessarily will happen if $H$ is infinite-dimensional and our collection $\{t_i\}$ is finite—then we may ask for an element $v$ that both minimizes our loss and has minimal norm in $H.$ In machine learning, this is called regularization.

Let's write $\Pi$ for the product of the projections $\pi_{t_i}$ Because $v$ is chosen with minimal norm, it cannot be made smaller by adjusting it by an element of $\ker \Pi,$ so $v$ is orthogonal to $\ker \Pi.$ But since the maps $\pi_{t}$ are continuous, they can be represented by vectors $K_{t}$ in the sense that $\pi_{t}(-) = \langle K_{t}, - \rangle.$ (This is the Riesz representation theorem.) Since $\ker \Pi$ can be described as the orthogonal complement to the set $\{K_{t_i}\},$ the orthogonal complement to $\ker \Pi$ is exactly the closure of the span of the vectors $K_{t_i}.$ We conclude that any regularized solution to our loss function is a (limit of) linear combinations of these vectors.

What are the projections of the "representative elements" $K_{t}$? By our own definition, we have $\pi_s(K_t) = \langle K_s, K_t \rangle$ for any other $s \in \R.$ This last expression is a positive semidefinite kernel, which we will denote $K(s, t).$ In other words, the norm on $H$ and the projections $\pi_t$ work together to produce a kernel function $K$ whose partial evaluations $K(-, t)$ help us solve optimization problems regularized by the norm of $H.$

In the literature, a Hilbert space equipped with bounded projections indexed over a set $I$ is called a reproducing kernel Hilbert space (or RKHS). In fact, we can also go in the other direction: every positive definite kernel on $I$ is "reproduced" by some RKHS, which also turns out to be unique in a certain sense. This is known as the Moore-Aronszajn theorem.

What RKHS corresponds to our kernel $K(s, t) = \min(s, t)$? In general, determining the RKHS of a kernel is not entirely straightforward. In fact, notice that for positive definite kernels over a finite set $I,$ the inner product for the RKHS expressed in the dual basis for our projections turns out to be the *inverse* of the matrix encoded by our kernel. Indeed, where $K_i$ are representatives of the projections $\pi_i$ and $e_i$ is a dual basis verifying $\pi_i(e_j) = \delta_{i, j},$ we find that
$\langle e_i, e_j \rangle K(j, k) = \langle e_i, e_j \rangle \pi_j(K_k) = \langle e_i, K_k \rangle = \pi_k(e_j) = \delta_{i, k}.$
So, interpreting a RKHS norm in terms of projections of elements requires solving some sort of inverse problem.

The RKHS for a centered Gaussian process $(X_t)$ can be viewed as an isometric embedding of the observables $X_t$ with respect to the $L^2$ norm for the process measure $\Fc.$ Specifically, if we define $f(X_t) = K_t,$ then clearly $\langle X_t, X_k \rangle_{L^2 \Fc} = \langle f(X_t), f(X_k) \rangle_\text{RKHS}.$ Indeed, the space of observables of a Gaussian process is already a RKHS for its covariance kernel, if we take the projections to be the maps $\langle X_t, - \rangle.$ However, we would like to view the RKHS more directly as a space of functions.

We may begin by observing, then, that observables of the Wiener process can be isometrically mapped into $L^2 [0, \infty)$ by sending $X_t$ to $K_s(t) = \begin{cases} 1 : t \le s \\ 0 : \text{otherwise.} \end{cases}$ Under this perspective, our projections become $\pi_t(f) = \langle K_t, f \rangle = \int_0^t f(s) \, ds.$ Ultimately, we are led to view the RKHS of the Wiener process as the Sobolev space of absolutely continuous functions $f \colon [0, \infty) \to \R$ such that $f(0) = 0$ and such that the norm $\lVert f \rVert = \left( \int_0^\infty (f'(t))^2 \, dt \right)$ is finite. In fact, solving the regularized interpolation problem $\begin{align*} \textbf{minimize} & \quad \int_0^\infty (f'(s))^2 \, dt \\ \textbf{subject to} & \quad f(0) = 0, f(t_i) = y_i \; \text{for all }i \end{align*}$ results in the piecewise affine interpolations we observed in the widget above.

So far, we have shown that its relationship with kernel functions gives gradient descent a distinct flavor of *implicit regularization*. We did not have a penalty function in mind when we set up our problem, but our distribution of random functions ended up making our gradient updates interpretable in terms of a RKHS for an associated kernel function. In the last section of this post, we address how this fact is related to the statistical idea of a conditional distribution for a Gaussian process.

When $X$ and $Y$ are jointly Gaussian distributed, we know that the remainder $Y - E(Y|X)$ of the conditional expectation is independent from $X.$ So, we can decompose $Y$ into two components $Y = (Y - E(Y|X)) + E(Y|X),$ the first being a Gaussian variable independent from $X$ and the second being $X$-measurable. This clarifies the nature of the conditional distribution of $Y$ on $X = x$: it will have constant variance equal to the variance of $Y - E(Y|X)$ and mean equal to $E(Y|X),$ a linear function of $X.$ In particular, if we want to sample the conditional distribution of $Y$ given $X = x,$ we could take $Y + E(Y|X = x) - E(Y|X) = Y + E(Y|X = x - X),$ where the apparently nonsensical conditional expectation on the RHS should be interpreted as the evaluation of the conditional expectation $E(Y|X),$ viewed as a function of $X,$ at $x - X.$ Keep in mind that this is a very special property of Gaussian distributions; in general, the distribution of the remainder $Y - E(Y|X)$ conditional on $X$ will depend on $X,$ and so we won't be able to sample the conditional distribution under another "counterfactual" value $X = x$ simply by translating a sample of the remainder.

Now, consider a random function $f$ drawn from a Gaussian distribution $\Fc$ and let $\Pi$ give the values of our trajectory on a finite set of inputs. If we want to produce a sample from the distribution of $f$ conditional on some data $\Pi = \pi,$ we can take $f - E(f | \Pi = \pi - \Pi).$ But, as it turns out, the conditional expectation $E(f|\Pi = \pi^*)$ will be exactly the function $f$ in the RKHS of our process that solves the constraint $\Pi = \pi^*$ regularized by the RKHS norm! This explains the Bayesian interpretation of our toy model.

One way to understand this is just to write out the expression for $E(f(t)|\Pi)$ at a given value of $t.$ We know that this will be the linear function $\lambda \Pi$ of $\Pi$ uniquely determined by the equation $\Cov(f(t) - \lambda \Pi, \Pi) = 0.$ Where $K = [K(t_i, t_j)]$ is the covariance matrix of $\Pi$ and $v = [\Cov(f(t), f(t_i))]$ gives the covariance of $f(t)$ with the components $f(t_i)$ of $\Pi,$ this equation can be written as $v - \lambda K = 0,$ and $E(f(t) | \Pi) = v K^{-1} \Pi.$ We conclude that the function $t \mapsto E(f(t)|\Pi)$ is a linear combination of the functions $K(t, t_i)$—the coordinates of the vector $v$—with constant coefficients, determined by the constraints that $E(f(t)|\Pi)$ should agree with $\Pi$ at the points $t = t_i.$ But, as we saw above, this is the same as the solution to the problem of interpolating some constraints $f(t_i) = y_i$ regularized by the RKHS norm of our process.

To see this connection more directly, remember that the mean of a Gaussian distribution coincides with the mode—the point of highest probability density under linear coordinates. So, for example, $E(f(t)|\Pi = \pi^*)$ is exactly the value of $f(t)$ that minimizes $\ln(p(f(t), \pi^*)) = C + -\frac{1}{2} [f(t) \; \pi^*(1) \; \dots \; \pi^*(m)] K^{-1} \begin{bmatrix} f(t) \\ \pi^*(1) \\ \vdots \\ \pi^*(m)] \end{bmatrix}$ where $K^{-1}$ is the inverse of the covariance matrix for $(f(t), \Pi).$ But from the previous section we know that $K^{-1}$ expresses the inner product of the RKHS derived from the covariance kernel of $(f(t), \Pi)$ in the dual basis for the projections. So in fact we are asking for the value of $(f(t), \Pi)$ that satisfies the constraint $\Pi = \pi^*$ and is regularized by the RKHS norm corresponding to a restriction of the covariance kernel of our process, which by the representer theorem will be a linear combination of restrictions of functions $K(-, t).$

Abstractly, whenever we have a positive definite kernel $K \colon I \times I \to \R$ with RKHS $H$ and a finite subset $J \subseteq I,$ we get a natural projection $P \colon H \to H_J$ onto the RKHS for the kernel restricted to $J \times J$ given by $P(v) = \sum_j \pi_j(v) K_j.$ Given a vector $v \in H,$ what is the norm of $P(v)$ in $H_J$? Since $P$ is an isometry over the span of the elements $K_j,$ we can view $\lVert P(v) \rVert_{H_J}$ as the minimum possible norm for an element $w \in H$ solving the equation $P(w) = P(v).$ In particular, solving a regularized problem over $H$ that depends on projections $\pi_j$ for $j \in J$ and then restricting the solution to $H_J$ is the same as restricting to $H_J$ and solving the regularized problem with respect to the norm on $H_J.$

As a final remark, note that we can informally imagine the RKHS of a Gaussian process as specifying the "energy" of the process in a statistical mechanics sense; although the norm of the RKHS is not defined over the same function space that the process takes values, we get the energy for the joint distribution of any finite projection $(f(t_1), \dots, f(t_m))$ as a function of $(y_1, \dots, y_m)$ by solving $\begin{align*} \textbf{minimize} & \quad \lVert f \rVert_H \\ \textbf{subject to} & \quad f(t_i) = y_i. \end{align*}$ This is the most satisfactory way that I've found to connect the interpretation of kernel functions in terms of regularization with their interpretation in terms of conditional expectations of a Gaussian process.