Last updated May 24, 2017

Place for me to store notes on reinforcement learning, with a focus on the details of the derivations.

This is a draft, and will never be more than a draft. Intended as a reference for me, may not be legible to other people. Only posting publicly in case other people find it useful.

TODO: Add paper links for where I first encountered these proofs, relevant papers, etc.

Table of Contents


Let be a random variable with known p.d.f. . (From here on, will be omitted. In general, will be specified the first time, and will be implicit all other times.) Let be some cost function defined as

for some arbitrary function that does not depend on .

We want to compute . Using the log derivative trick, we can change the gradient of the expectation to the expectation of the gradient.

Now that this is an expectation, we can get an approximate gradient by sampling .

So far, none of the derivation is RL specific. Let’s now suppose the random variable is trajectories , generated by policy in an MDP. Let’s then have be the reward over the trajectory. is now the expected reward of the policy, and the gradient gives a way to update the policy to increased reward.

Once we assume we’re in an MDP, we get some magic cancellations.

Note that the dynamics ( and ) do not depend on , so when we compute the gradient, they get cancelled out, leaving.

Expectation of Score Function is Zero.


Any time we can move everything except the score function out of the expectation, we can cancel that term.

The equality above is true for any distribution, but the distribution is especially helpful for MDPs. By the Markov property, is the same as .

REINFORCE Variance Reduction

Now that we’re married to the MDP framework, there are ways to quickly reduce variance without adding any bias.

Observation 1: When we take an action at timestep , it can only affect the rewards from timesteps and onward. Assuming , let

Note .

We’ll use similar notation to denote trajectories. indicates states, actions, and rewards from time to .

Slightly reorder the gradient as the sum of expectations

Then split into the sum of rewards before and after time .

We can move out of the first expectation, since none of those rewards depends on . By the earlier lemma, this gives

(Can show this formally by writing out the conditional expectations, but it gets messy fast.)

Right away, this lets us drop about half the terms from the gradient (from roughly to roughly .)

Observation 2: We can subtract a baseline function without biasing the gradient, as long as the baseline has certain properties.

Recall that the expectation of the score functions is zero.

Let be a baseline function that depends on . Since the expectation is conditioned on , we get

Because of this, we can subtract a state-dependent baseline from the cumulative rewards without changing the expectation.

In practice, people usually try to estimate and use that as their baseline. Intuitively, this “centers” the scaling of . If we estimated exactly, the scaling is positive if and only if the reward for taking action is better than the average.


Bellman Operators

Operators map functions to functions. We can think of each application of an operator as one step of an optimization.

Use to denote operators.

Proof of Convergence For Tabular Problems

Let be the Bellman operator for . Define this as

We now prove that if we repeatedly apply to any , we converge to .

First, show that is a fixed point. Follows by definitions.

Now, prove is a contraction. An operator is a contraction if

for some and some distance function.

Why do we want to prove this? By the Banach fixed-point theorem, for any contraction,

  • Repeatedly applying the contraction converges to a fixed point.
  • That fixed point is unique.

We use max norm as the distance function. . In this case, the supremum is over state-action pairs . For any ,

Therefore, , and since , operator is a contraction.

We’ve now shown the Bellman operator converges to . Let be the optimal policy. What does the Bellman operator for that look like?

The optimal policy will take the action that maximizes , which converts the expectation into a max.

This recovers the 1-step return that appears all over the place. Since this is a special case of , by earlier argument this converges to .

In practice, we can never apply the Bellman operator exactly, except for very small problems where we can easily iterate over all . Instead, we have parametrized Q-functions , then take gradient steps to minimze the TD-error

With parametrized Q-functions, we are no longer guaranteed to converge,

TODO: add soft Q-Learning.

Natural Policy Gradient

Natural policy gradient is a special case of natural gradient, so let’s explain that first.

Suppose we have some objective function that’s differentiable. When optimizing , why do we step in direction ? It’s because the direction of steepest descent is the gradient. Quick argument follows.

The direction of steepest descent is the that solves

for sufficiently small .

As , the first order Taylor approximation gets more accurate, giving

We can alternatively write the dot product as

where is the angle between the two vectors. Since the gradient norm is constant, this is minimized when and . This gives step direction , and update.

In this derivation, we implicitly assumed was defined as the Euclidean norm. We could alternatively define norm as

where is some symmetric positive definite matrix. The Euclidean norm is a special case of this, where is the identity matrix.

(Why is it sufficient to consider symmetric positive definite matrices? We care specifically about the quadratic form . For any non-symmetric , you get the same function if you use instead, so for any quadratic form , there exists some equivalent symmetric positive definite matrix.)

Suppose we required the step direction to be instead. Now what’s the optimal step direction? Note that different directions have different maximum Euclidean norms, so the direction aligned with may no longer be optimal.

(Linear algebra interlude because I keep forgetting basic linear algebra.)

Lemma: Every positive definite matrix is invertible.

Proof: Suppose not. Then there exist such that and . Equivalently, for some non-zero vector . But then , contradicting the positive definite definition, so must be invertible. .

To get the step direction, we solve

Solve this constrained optimization problem through Lagrange multipliers. To simplify notation let . At the solution , we must have

Which gives

Since is invertible, this has solution

So the optimal step direction is .

In regular gradient descent, we always take small steps in parameter space, defined by the Euclidean distance in parameter space. Natural gradient argues that we should instead take small steps in the space of probability distributions. We can do this by measuring distance with the Fisher information matrix.

(I’m pretty sure this argument is wrong in some subtle way, but it’s approximately correct.)

Natural policy gradient is the idea of natural gradient, applied to the policy gradient from REINFORCE. At a high level it’s actually pretty simple.

  1. Decide on some learning rate .
  2. Approximate the inverse Fisher from the data.
  3. Compute the gradient REINFORCE would have given.
  4. Update with .

In practice, you need to use conjugate gradients to compute effeciently - computing explicitly takes time, where is the number of parameters, and conjugate gradients let you approximate in linear time.

(This actually gets you 80% of the way to Trust Region Policy Optimization, since at the end of its derivation it reduces to natural policy gradient with an adaptive learning rate to take the largest step it can within the trust region.)

Equivalent Formulations of Policies and Reward

(Adapted from a mix of Generative Adversarial Imitation Learning and Trust-Region Policy Optimization.)

State Visitation / Occupency Measure

The state visitation frequency is defined as the discounted sum of probabilities of visiting a given state.

Note that up to a constant factor, you can think of this as a probability distribution over states.

More precisely, multiplying by gives a probability distribution.

The occupency measure is defined similarly, except it’s the discounted sum of probabilityes of visiting state-action pairs , instead of just states .

(I know this is an abuse of notation. I’m sorry.)

The expected reward of policy is

which can be interpreted as taking the average reward over all trajectories. But alternatively, we can count how often we visit state-action pairs , and sum over that.

Argument that works for me - imagine a giant grid of all pairs. Whenever you execute an episode, when you encounter , add to that square in the grid. The sum of values in the grid is the reward of the episode. The sum of values in the average grid is the expected reward of . But equivalently, we could ask, “What value is in cell in the average grid?” And it turns out that cell will have value .

TODO: add proof of bijection between occupnecy measures and policies .

Why bother formulating the problem this way? It gives another way to think about the learning problem. In the trajectory view, you want to update your policy such that episodes that give good reward happen more often. In the state-action view, you want to update such that you visit good more often. It can be easier to make arguments in one view than in the other.

Advantage With Respect to Another Policy

The expected return of can be defined by its expected advantage over , which is handy for thinking about the difference between policies.


At this point, note that the last two summations are identical, except the first starts counting from and the other starts counting from . This lets us cancel all terms except for in the second summation.

Now note that because the initial state is independent of the policy, the first expectation changes to . The second expectations is by definition. Therefore,

completing the proof.


In REINFORCE, you can subtract a state-dependent baseline without biasing the gradient. The Q-Prop paper gives a way to subtract an action-dependent baseline.

The argument is based on control variates, a variance reduction technique. The Wikipedia page is pretty good.

Suppose we want to estimate . We can do so by sampling from . Now, let be another function, whose expectation is known / can be computed analytically. Let and . Then

also has expectation for any . If and are correlated, a good choice of can make the variance of lower than the variance of .

To apply control variates to RL, recall that we showed the policy gradient was

We want to subtract an action dependent baseline . Next, we make an observation from the MuProp paper: first-order Taylor expansions make nice control variates. We’ll see how the linear function makes the math nice. (Or at least, nicer. It’s still pretty bashy.)

Let be some arbitrary function. Since each expectation is conditioned on , it suffices to consider the Taylor expansion with respect to just . Let be some arbitrary action, depending only on . The 1st-order Taylor expansion around is

We plan to compute

This is trivially equal to the original gradient, which leaves analytically computing the 2nd term. Think of this as the offset we need to re-add to the gradient to make the gradient estimator unbiased again.

In the first expectation, because depends only on , the scaling factor depends only on , and therefore the expectation goes to (we can move out of the expectation, then note expectation of the score function is .)

In the last expectation, the same is also true. depends only on , which only depends on , and therefore can also move out of the expectation, and therefore that term goes to too.

That leaves the middle expectation. At this point, note the expectation over is the same as the expectation over , since at this point no terms depend on other states or actions.

Points of confusion I ran into: we’re applying the same trick of . We’re multiplying a vector by a matrix to get a vector. (Recall we want the expectation to return an offset for the gradient with respect to .)

We’ve ended with multiplying the gradient w.r.t at with the gradient w.r.t. of the mean action of policy . From here, we can make a number of natural choices.

  • Let , where is some learned actor (some neural net), and is either constant or also learned. There’s no specific reason needs to be Gaussian, as long as the mean action has an analytic closed form.
  • For the th term, let .
  • For , let be an estimate of the Q-function , where are the weights of . We update in between computing and applying each policy gradient.

This gives the final gradient.

(The paper interprets as the empirical Q-value. Under that interpretation, each gradient is scaled by the difference between the empirical and estimated Q-values.)

TODO: explain extenstion to advantage estimation.

Why go through all this effort to learn an action-dependent baseline? Intuitvely, if our baseline depends on both state and action, we can do a better job of “centering” the gradient. we can also learn with off-policy data saved in a replay buffer, with standard Q-Learning.

(A common choice of state-dependent baseline is a learned value function . Why can’t we learn with off-policy data? The problem is that it doesn’t have the same self-consistency objective as the TD-error in Q-Learning, it requires knowing the action the current policy would take, which requires on-policy data collection.)

Important note: Q-Prop is an approach for estimating the gradient. It says nothing about how to apply it. You can directly add the gradient with SGD, or you can use natural policy gradient, or TRPO.

TODO: explain aggressive vs conservative Q-Prop.