A Markov decision process is, from a certain point of view, just a Markov process whose transition probabilities are controllable. More specifically, they depend on what "action" we decide to take as a function of the system's state. The overall decision function is called a policy. From this point of view, a natural question to ask is: how can we differentiate a measure of success at achieving a goal with respect to our policy? If we can answer this question, we have at least one way to optimize our decision-making: try to empirically estimate its gradient under a measure of success and hill-climb along the gradient.
In reinforcement learning, the custom is to define reward as a function of action/state pairs, and then use expected reward (integrated over time in some way) as an objective to maximize. However, it is more or less equivalent to define reward as a function of only state. Thus, where is the kernel of the process—we will assume we have only finitely many states and thus treat as a matrix—we may think of the standard time-discounted expected reward as an expression where is a time discount factor, is a vector giving state-wise "rewards", and is the initial distribution of the process. Thanks to our state-wise reward formulation, this expression simplifies considerably: (This series converges because is a contraction.) Differentiating with respect to is now just a question of differentiating the matrix inverse.
Maybe you've seen the derivative of the matrix inverse before. Certainly you've seen the derivative of the inverse for matrices: The same idea works in any commutative ring. However, non-commutativity of matrix multiplication gives things a slight twist.
It's worth mentioning that we can proceed directly by throwing a formal infinitesimal into the power series for : (The formal infinitesimal approach is still the only way I'm comfortable differentiating noncommutative series like this.) An even simpler approach is to see that, defining to be the derivative of with respect to in the sense that differentiating the relation with respect to in the direction of gives from whence in agreement with the above. Yet another approach—perhaps the most obvious in retrospect—is to note that the matrix inverse has a simple inverse at the identity, and then use that is an antihomomorphism to transport this information: Whichever way you choose, all this comes straight from the toolbox of matrix algebra.
Let's abbreviate , so that . Using to denote a gradient with respect to our policy—which we are assuming affects but not or —we get What is the meaning of ? The coordinates of this covector tell us the expected discounted reward we would get if the process were started at a given state. What about ? This is the expected state vector—again, weighted by an exponential discount over time. All that remains is to determine , which should be a simple function of our change in policy. Compare this expression with the policy gradient theorem from a reference like Sutton's book on reinforcement learning, and you'll find essentially the same thing.