← cgad.ski
2023-04-13

A *Markov decision process* is, from a certain point of view, just a Markov process whose transition probabilities are *controllable*. More specifically, they depend on what "action" we decide to take as a function of the system's state. The overall decision function is called a *policy*. From this point of view, a natural question to ask is: how can we differentiate a measure of success at achieving a goal with respect to our policy? If we can answer this question, we have at least one way to optimize our decision-making: try to empirically estimate its gradient under a measure of success and hill-climb along the gradient.

In reinforcement learning, the custom is to define *reward* as a function of action/state pairs, and then use expected reward (integrated over time in some way) as an objective to maximize. However, it is more or less equivalent to define reward as a function of only state. Thus, where $A$ is the kernel of the process—we will assume we have only finitely many states and thus treat $A$ as a matrix—we may think of the standard *time-discounted expected reward* as an expression
$J(A) = \sum_{k = 0}^\infty \gamma^k \langle v , A^k s_0 \rangle$
where $\gamma \in (0, 1)$ is a time discount factor, $v$ is a vector giving state-wise "rewards", and $s_0$ is the initial distribution of the process. Thanks to our state-wise reward formulation, this expression simplifies considerably:
$\sum_{k = 0}^\infty \gamma^k \langle v , A^k s_0 \rangle = \left\langle v, \sum_{k = 0}^\infty \gamma^k A^k s_0 \right\rangle = \langle v, (I - \gamma A)^{-1} s_0\rangle.$
(This series converges because $\gamma A$ is a contraction.) Differentiating with respect to $A$ is now just a question of differentiating the matrix inverse.

Maybe you've seen the derivative of the matrix inverse before. Certainly you've seen the derivative of the inverse for $1 \times 1$ matrices: $\frac{d}{dx}x^{-1} = -x^{-2}.$ The same idea works in any commutative ring. However, non-commutativity of matrix multiplication gives things a slight twist.

It's worth mentioning that we can proceed directly by throwing a formal infinitesimal into the power series for $(I - A)^{-1}$: $\begin{align*} (I - (A + \epsilon B))^{-1} & = \sum_{t = 0}^{\infty} (A + \epsilon B)^t = (I - A)^{-1} + \epsilon \sum_{t = 0}^\infty \sum_{k = 0}^{t - 1} A^k B A^{t - k - 1}\\ & = (I - A)^{-1} + \epsilon \sum_{i, j \ge 0} A^i B A^j \\ & = (I - A)^{-1} + \epsilon (I - A)^{-1} B (I - A)^{-1}. \end{align*}$ (The formal infinitesimal approach is still the only way I'm comfortable differentiating noncommutative series like this.) An even simpler approach is to see that, defining $F(A, -)$ to be the derivative of $A^{-1}$ with respect to $A$ in the sense that $F(A, B) = \frac{d}{dt}(A + t B)^{-1},$ differentiating the relation $I = A A^{-1}$ with respect to $A$ in the direction of $B$ gives $0 = B A^{-1} + AF(A, B),$ from whence $F(A, B) = -A^{-1} B A^{-1}$ in agreement with the above. Yet another approach—perhaps the most obvious in retrospect—is to note that the matrix inverse has a simple inverse at the identity, and then use that $(-)^{-1}$ is an antihomomorphism to transport this information: $\begin{align*} (A + \epsilon B)^{-1} & = (A (I + \epsilon A^{-1} B))^{-1} \\ & = (I + \epsilon A^{-1} B)^{-1} A^{-1} \\ & = (I - \epsilon A^{-1} B) A^{-1}. \end{align*}$ Whichever way you choose, all this comes straight from the toolbox of matrix algebra.

Let's abbreviate $N(A) = (I - \gamma A)^{-1}$, so that $J(A) = \langle v, N(A) s_0 \rangle$. Using $\nabla$ to denote a gradient with respect to our policy—which we are assuming affects $A$ but not $v$ or $s_0$—we get $\nabla J(A) = \nabla \langle v, N(A) s_0 \rangle = \gamma \langle v, N(A) (\nabla A) N(A) s_0 \rangle.$ What is the meaning of $\langle v, N(A) -\rangle = \langle N(A)^T v, - \rangle$? The coordinates of this covector tell us the expected discounted reward we would get if the process were started at a given state. What about $N(A) s_0$? This is the expected state vector—again, weighted by an exponential discount over time. All that remains is to determine $\nabla A$, which should be a simple function of our change in policy. Compare this expression with the policy gradient theorem from a reference like Sutton's book on reinforcement learning, and you'll find essentially the same thing.