← cgad.ski 2023-04-13

The Policy Gradient Theorem

A Markov decision process is, from a certain point of view, just a Markov process whose transition probabilities are controllable. More specifically, they depend on what "action" we decide to take as a function of the system's state. The overall decision function is called a policy. From this point of view, a natural question to ask is: how can we differentiate a measure of success at achieving a goal with respect to our policy? If we can answer this question, we have at least one way to optimize our decision-making: try to empirically estimate its gradient under a measure of success and hill-climb along the gradient.

In reinforcement learning, the custom is to define reward as a function of action/state pairs, and then use expected reward (integrated over time in some way) as an objective to maximize. However, it is more or less equivalent to define reward as a function of only state. Thus, where AA is the kernel of the process—we will assume we have only finitely many states and thus treat AA as a matrix—we may think of the standard time-discounted expected reward as an expression J(A)=k=0γkv,Aks0J(A) = \sum_{k = 0}^\infty \gamma^k \langle v , A^k s_0 \rangle where γ(0,1)\gamma \in (0, 1) is a time discount factor, vv is a vector giving state-wise "rewards", and s0s_0 is the initial distribution of the process. Thanks to our state-wise reward formulation, this expression simplifies considerably: k=0γkv,Aks0=v,k=0γkAks0=v,(IγA)1s0.\sum_{k = 0}^\infty \gamma^k \langle v , A^k s_0 \rangle = \left\langle v, \sum_{k = 0}^\infty \gamma^k A^k s_0 \right\rangle = \langle v, (I - \gamma A)^{-1} s_0\rangle. (This series converges because γA\gamma A is a contraction.) Differentiating with respect to AA is now just a question of differentiating the matrix inverse.

Maybe you've seen the derivative of the matrix inverse before. Certainly you've seen the derivative of the inverse for 1×11 \times 1 matrices: ddxx1=x2.\frac{d}{dx}x^{-1} = -x^{-2}. The same idea works in any commutative ring. However, non-commutativity of matrix multiplication gives things a slight twist.

It's worth mentioning that we can proceed directly by throwing a formal infinitesimal into the power series for (IA)1(I - A)^{-1}: (I(A+ϵB))1=t=0(A+ϵB)t=(IA)1+ϵt=0k=0t1AkBAtk1=(IA)1+ϵi,j0AiBAj=(IA)1+ϵ(IA)1B(IA)1.\begin{align*} (I - (A + \epsilon B))^{-1} & = \sum_{t = 0}^{\infty} (A + \epsilon B)^t = (I - A)^{-1} + \epsilon \sum_{t = 0}^\infty \sum_{k = 0}^{t - 1} A^k B A^{t - k - 1}\\ & = (I - A)^{-1} + \epsilon \sum_{i, j \ge 0} A^i B A^j \\ & = (I - A)^{-1} + \epsilon (I - A)^{-1} B (I - A)^{-1}. \end{align*} (The formal infinitesimal approach is still the only way I'm comfortable differentiating noncommutative series like this.) An even simpler approach is to see that, defining F(A,)F(A, -) to be the derivative of A1A^{-1} with respect to AA in the sense that F(A,B)=ddt(A+tB)1,F(A, B) = \frac{d}{dt}(A + t B)^{-1}, differentiating the relation I=AA1I = A A^{-1} with respect to AA in the direction of BB gives 0=BA1+AF(A,B),0 = B A^{-1} + AF(A, B), from whence F(A,B)=A1BA1F(A, B) = -A^{-1} B A^{-1} in agreement with the above. Yet another approach—perhaps the most obvious in retrospect—is to note that the matrix inverse has a simple inverse at the identity, and then use that ()1(-)^{-1} is an antihomomorphism to transport this information: (A+ϵB)1=(A(I+ϵA1B))1=(I+ϵA1B)1A1=(IϵA1B)A1.\begin{align*} (A + \epsilon B)^{-1} & = (A (I + \epsilon A^{-1} B))^{-1} \\ & = (I + \epsilon A^{-1} B)^{-1} A^{-1} \\ & = (I - \epsilon A^{-1} B) A^{-1}. \end{align*} Whichever way you choose, all this comes straight from the toolbox of matrix algebra.

Let's abbreviate N(A)=(IγA)1N(A) = (I - \gamma A)^{-1}, so that J(A)=v,N(A)s0J(A) = \langle v, N(A) s_0 \rangle. Using \nabla to denote a gradient with respect to our policy—which we are assuming affects AA but not vv or s0s_0—we get J(A)=v,N(A)s0=γv,N(A)(A)N(A)s0.\nabla J(A) = \nabla \langle v, N(A) s_0 \rangle = \gamma \langle v, N(A) (\nabla A) N(A) s_0 \rangle. What is the meaning of v,N(A)=N(A)Tv,\langle v, N(A) -\rangle = \langle N(A)^T v, - \rangle? The coordinates of this covector tell us the expected discounted reward we would get if the process were started at a given state. What about N(A)s0N(A) s_0? This is the expected state vector—again, weighted by an exponential discount over time. All that remains is to determine A\nabla A, which should be a simple function of our change in policy. Compare this expression with the policy gradient theorem from a reference like Sutton's book on reinforcement learning, and you'll find essentially the same thing.

← cgad.ski