Road to ML Engineer #43 - Policy Gradient Methods

Last Edited: 2/25/2025

The blog post introduces the policy gradient methods in reinforcement learning.

ML

So far, we have been discussing state-action value methods, where we update the estimations of state-action values provided by a table or a function to deduce a greedy or ε-greedy policy. However, we discovered that the tabular approach has high memory requirements, and the functional approach struggles with convergence for real-life applications. In this article, we will discuss a new approach, the policy gradient method, which aims to solve these problems.

Policy Gradient

Instead of updating the estimations of the state-action values, we can directly learn policy π(s,a)\pi_{*}(s, a) that maximizes rewards by fitting a parameterized policy function πθ(s,a)\pi_{\theta}(s, a) with respect to its parameters θ\theta. The reward function J(θ)J(\theta) that we try to maximize is determined by the expected value of the state-value under the policy.

dπ(s)=limtP(st=ss0,π)J(θ)=sSdπθ(s)Vπθ(s)=sSdπθ(s)aAπθ(as)Qπθ(s,a) d_{\pi}(s) = \lim_{t \to \infty} P(s_t = s | s_0, \pi) \\ J(\theta) = \sum_{s \in S} d_{\pi_{\theta}}(s) V_{\pi_{\theta}}(s) = \sum_{s \in S} d_{\pi_{\theta}}(s) \sum_{a \in A} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a)

To compute this expected value, we can use the stationary distribution of the Markov chain's states, denoted as dπ(s)d_{\pi}(s). We can then compute the gradient of the reward function J(θ)J(\theta) with respect to its parameters θ\theta, which will be used for gradient ascent to learn the optimal set of parameters. This method tends to converge smoother than functional approximation on state-value functions and is more effective in high-dimensional or continuous action spaces, though it may converge to a local minimum and face bias-variance tradeoff.

Policy Gradient Theorem

To compute the gradient of the reward function, we start by computing θVπθ(s)\nabla_{\theta}V_{\pi_{\theta}}(s), which is equivalent to θaAπθ(as)Qπθ(s,a)\nabla_{\theta}\sum_{a \in A} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a). Using the product rule, we can derive the derivative as follows.

ddθVπθ(s)=ddθaAπθ(as)Qπθ(s,a)=aAddθπθ(as)Qπθ(s,a)+πθ(as)ddθQπθ(s,a)=aAddθπθ(as)Qπθ(s,a)+πθ(as)ddθs,rP(s,rs,a)(r+Vπθ(s))=aAddθπθ(as)Qπθ(s,a)+πθ(as)s,rP(s,rs,a)ddθVπθ(s)=aAddθπθ(as)Qπθ(s,a)+πθ(as)sP(ss,a)ddθVπθ(s) \frac{d}{d \theta} V_{\pi_{\theta}}(s) = \frac{d}{d \theta} \sum_{a \in A} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) \\ = \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) + \pi_{\theta}(a | s) \frac{d}{d \theta} Q_{\pi_{\theta}}(s, a) \\ = \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) + \pi_{\theta}(a | s) \frac{d}{d \theta} \sum_{s', r} P(s', r | s, a) (r + V_{\pi_{\theta}}(s')) \\ = \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) + \pi_{\theta}(a | s) \sum_{s', r} P(s', r | s, a) \frac{d}{d \theta} V_{\pi_{\theta}}(s') \\ = \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) + \pi_{\theta}(a | s) \sum_{s'} P(s' | s, a) \frac{d}{d \theta} V_{\pi_{\theta}}(s')

We notice that there is a recursive structure in the above expression. To clarify this relationship, we set up two new functions: ρπ(ss,k)\rho_{\pi}(s \to s', k), which represents the state transition probability from ss to ss' under the policy π\pi in kk steps, and ϕ(s)=aAddθπθ(as)Qπθ(s,a)\phi(s) = \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a). Using these functions, we can expand the recursive structure as follows.

ddθVπθ(s)=ϕ(s)+aAπθ(as)sP(ss,a)ddθVπθ(s)=ϕ(s)+saπθ(as)P(ss,a)ddθVπθ(s)=ϕ(s)+sρπθ(ss,1)ddθVπθ(s)=ϕ(s)+sρπθ(ss,1)[ϕ(s)+sρπθ(ss,1)ddθVπθ(s)]=ϕ(s)+sρπθ(ss,1)ϕ(s)+sρπθ(ss,2)ddθVπθ(s)=xSk=0ρπθ(sx,k)ϕ(s) \frac{d}{d \theta} V_{\pi_{\theta}}(s) = \phi(s) + \sum_{a \in A} \pi_{\theta}(a | s) \sum_{s'} P(s' | s, a) \frac{d}{d \theta} V_{\pi_{\theta}}(s') \\ = \phi(s) + \sum_{s'} \sum_a \pi_{\theta}(a | s) P(s' | s, a) \frac{d}{d \theta} V_{\pi_{\theta}}(s') \\ = \phi(s) + \sum_{s'} \rho_{\pi_{\theta}}(s \to s', 1) \frac{d}{d \theta} V_{\pi_{\theta}}(s') \\ = \phi(s) + \sum_{s'} \rho_{\pi_{\theta}}(s \to s', 1) [\phi(s') + \sum_{s''} \rho_{\pi_{\theta}}(s' \to s'', 1) \frac{d}{d \theta} V_{\pi_{\theta}}(s'')] \\ = \phi(s) + \sum_{s'} \rho_{\pi_{\theta}}(s \to s', 1) \phi(s') + \sum_{s''} \rho_{\pi_{\theta}}(s \to s'', 2) \frac{d}{d \theta} V_{\pi_{\theta}}(s'') \\ = \sum_{x \in S} \sum_{k=0}^{\infty} \rho_{\pi_{\theta}}(s \to x, k) \phi(s)

This expansion is valid because ρπ(sx,0)=1\rho_{\pi}(s \to x, 0) = 1 and ρπ(sx,k+1)=ρπ(ss,k)+ρπ(sx,1)\rho_{\pi}(s \to x, k+1) = \rho_{\pi}(s \to s', k) + \rho_{\pi}(s' \to x, 1). Here, we can see that the gradient of the reward function is proportional to the above expression, which can be derived further as follows.

ddθJ(θ)ddθVπθ(s)=xSk=0ρπθ(sx,k)ϕ(s)xSk=0ρπθ(sx,k)sk=0ρπθ(sx,k)ϕ(s)=xSdπθ(s)ϕ(s)=xSdπθ(s)aAddθπθ(as)Qπθ(s,a)=xSdπθ(s)aAπθ(as)Qπθ(s,a)ddθπθ(as)πθ(as)=Esdπθ,aπθ[Qπθ(s,a)ddθln(πθ(as))] \frac{d}{d \theta} J(\theta) \propto \frac{d}{d \theta} V_{\pi_{\theta}}(s) = \sum_{x \in S} \sum_{k=0}^{\infty} \rho_{\pi_{\theta}}(s \to x, k) \phi(s) \\ \propto \sum_{x \in S} \frac{ \sum_{k=0}^{\infty} \rho_{\pi_{\theta}}(s \to x, k)}{\sum_s \sum_{k=0}^{\infty} \rho_{\pi_{\theta}}(s \to x, k)} \phi(s) \\ = \sum_{x \in S} d_{\pi_{\theta}}(s) \phi(s) \\ = \sum_{x \in S} d_{\pi_{\theta}}(s) \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) \\ = \sum_{x \in S} d_{\pi_{\theta}}(s) \sum_{a \in A} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) \frac{\frac{d}{d \theta} \pi_{\theta}(a | s)}{\pi_{\theta}(a | s)} \\ = \text{E}_{s \sim d_{\pi_{\theta}}, a \sim \pi_{\theta}} [Q_{\pi_{\theta}}(s, a) \frac{d}{d \theta} \ln(\pi_{\theta}(a | s))]

By observing this expansion, we can see that the gradient of the reward function is proportional to the expected value of the product between the state-action value and the derivative of the natural log of the policy. This proportionality is known as the Policy Gradient Theorem, which allows us to simplify the computation of the gradient for performing gradient ascent. For our intuitive understanding, we can interpret this computation as adjusting the parameters in the direction that maximizes the likelihood of taking actions with the highest state-action values or advantages.

Here, we can also use the advantage A(s,a)A(s, a) instead of Q(s,a)Q(s, a) to stabilize learning. For making use of deep learning frameworks that compute the gradient from the loss function and perform optimization with various optimizers, we can use the objective J(θ)=Eπθ[Qπθ(s,a)ln(πθ(as))]J(\theta) = \text{E}_{\pi_{\theta}} [Q_{\pi_{\theta}}(s, a) \ln(\pi_{\theta}(a | s))], whose gradients can be computed as Eπθ[Qπθ(s,a)θln(πθ(as))]\text{E}_{\pi_{\theta}} [Q_{\pi_{\theta}}(s, a) \frac{\partial}{\partial \theta} \ln(\pi_{\theta}(a | s))].

REINFORCE

REINFORCE (Monte Carlo policy gradient) samples an episode and uses GtG_t as an approximation of Qπ(s,a)Q_{\pi}(s, a) to compute the gradient (θJ(θ)=Eπ[Gtθln(πθ(atst))]\nabla_{\theta} J(\theta) = \text{E}_{\pi}[G_t \nabla_{\theta} \ln(\pi_{\theta}(a_t | s_t))]). The following is an example implementation of a parameterized non-linear policy function (feedforward neural policy network) and the REINFORCE algorithm for training the policy network on the Frozen Lake environment from Gymnasium.

class PolicyNetwork(nn.Module):
    def __init__(self):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(16, 8)
        self.fc2 = nn.Linear(8, 4)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=0)
 
def one_hot_encoding(index):
  encoding = np.array([1 if i == index else 0 for i in range(16)])
  return torch.tensor(encoding, dtype=torch.float32)
 
def policy_loss(action, sampled_action, result_sum, gamma, t):
  log_prob = torch.log(action[sampled_action])
  loss = log_prob * gamma**t * result_sum
  loss = -torch.mean(loss)
  return loss
 
def REINFORCE(env, policy_network, optimizer, num_episodes, gamma):
  stats = 0
  for episode in range(num_episodes):
      state, _ = env.reset()
      done = False
      results_list = []
      result_sum = 0.0
 
      # Policy Evaluation
      policy_network.eval()
      while not done:
          action = policy_network(one_hot_encoding(state))
          sampled_action = torch.multinomial(action, 1)[0]
          new_state, reward, done, _, _ = env.step(sampled_action.item())
          results_list.append((state, action, sampled_action.item()))
          result_sum += reward
          state = new_state
 
      # Policy Improvement
      if result_sum != 0:
        policy_network.train()
        for t, (state, action, sampled_action) in enumerate(results_list):
            # Backpropagation using Policy Gradient Theorem
            loss = policy_loss(action, sampled_action, result_sum, gamma, t)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
 
      stats += result_sum
      if episode % 10 == 0 and episode != 0:
          print(f"episode: {episode}, success: {stats/10}")
          stats = 0
 
  print(f"episode: {episode}, success: {stats/episode}")
 
  env.close()
 
learning_rate = 0.001
gamma = 1
num_episodes = 1000
policy_network = PolicyNetwork()
optimizer = torch.optim.Adam(policy_network.parameters(), lr=learning_rate)
REINFORCE(env, policy_network, optimizer, num_episodes, gamma)

After sampling an episode, we backtrack the episode and update the parameters by J(θ)=γtGtln(πθ(atst))J(\theta) = \gamma^t G_t \ln(\pi_{\theta}(a_t | s_t)). Since the added term is zero when the return is zero, and the return is mostly zero in this environment, it takes time for learning to start and even longer to converge to the optimal policy.

Actor-Critic

Instead of using Monte Carlo methods for computing the policy gradient, we can utilize TD(0) by training the function approximator qwq_w along with πθ\pi_{\theta} and updating the policy at every step using the output of qwq_w (θJ(θ)=Eπ[qw(st,at)θln(πθ(atst))]\nabla_{\theta} J(\theta) = \text{E}_{\pi}[q_w(s_t, a_t) \nabla_{\theta} \ln(\pi_{\theta}(a_t | s_t))]). This method is called actor-critic since we can perceive πθ\pi_{\theta} as an actor and qwq_w as a critic. The following is an example code implementation of actor-critic in the Frozen Lake environment.

class QNetwork(nn.Module):
    def __init__(self):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(20, 10)
        self.fc2 = nn.Linear(10, 1)
 
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
 
def q_one_hot_encoding(state, action):
  state_encoding = one_hot_encoding(state)
  action_encoding = np.array([1 if i == action else 0 for i in range(4)])
  action_encoding = torch.tensor(action_encoding, dtype=torch.float32)
  return torch.cat((state_encoding, action_encoding), dim=0)
 
def Actor_Critic(env, policy_network, q_network, policy_optimizer, q_optimizer, num_episodes, gamma):
  torch.autograd.set_detect_anomaly(True)
  for episode in range(num_episodes):
      state, _ = env.reset()
      done = False
      action = policy_network(one_hot_encoding(state))
      sampled_action = torch.multinomial(action, 1)[0]
      q_value = q_network(q_one_hot_encoding(state, sampled_action.item()))
 
      while not done:
          # Forward Pass
          policy_network.eval()
          q_network.eval()
          new_state, reward, done, _, _ = env.step(sampled_action.item())
          next_action = policy_network(one_hot_encoding(new_state))
          next_sampled_action = torch.multinomial(next_action, 1)[0]
 
          # Polocy Update
          policy_network.train()
          loss = policy_loss(action, sampled_action, q_value, gamma, 0)
          policy_optimizer.zero_grad()
          loss.backward()
          policy_optimizer.step()
 
          # Q Update
          q_value_next = q_network(q_one_hot_encoding(new_state, next_sampled_action.item()))
          q_network.train()
          td_loss = (reward + gamma * q_value_next - q_value)**2
          q_optimizer.zero_grad()
          td_loss.backward()
          q_optimizer.step()
 
          state = new_state
          action = next_action
          sampled_action = next_sampled_action
          q_value = q_value_next
 
      if episode % 10 == 0 and episode != 0:
        print(f"episode: {episode}/{num_episodes}")
 
  print(f"episode: {episode}/{num_episodes}")
 
  env.close()
 
q_network = QNetwork()
policy_optimizer = torch.optim.Adam(policy_network.parameters(), lr=learning_rate)
q_optimizer = torch.optim.Adam(q_network.parameters(), lr=learning_rate)
Actor_Critic(env, policy_network, q_network, policy_optimizer, q_optimizer, num_episodes, gamma)

If the state is complex, the two networks can share the same network to produce the latent state embedding. Although this naive actor-critic method tends to have faster iteration, it still struggles to learn the policy since the data is not i.i.d.

Eligibility Traces

As mentioned earlier, we can use advantage instead of Q value in the policy objective to stabilize learning and allow the critic to output only the state value. This change enables us to share the state processing part between the policy and critic networks, simplifying the model architecture. When using TD(0), the objective function for the actor and critic becomes the following:

J(θ)Et[A(st,at)ln(πθ(atst))]+Et[(V(st)Vw(st))2]=Et[Q(st,at)V(st)ln(πθ(atst))]+Et[(V(st)Vw(st))2]Et[rt+γVw(st+1)Vw(st)ln(πθ(atst))]+Et[(rt+γVw(st+1)Vw(st))2] J(\theta) \approx \text{E}_t[A(s_t, a_t) \ln(\pi_{\theta}(a_t | s_t))] + \text{E}_t[(V(s_t) - V_{w}(s_t))^2] \\ = \text{E}_t[Q(s_t, a_t) - V(s_t) \ln(\pi_{\theta}(a_t | s_t))] + \text{E}_t[(V(s_t) - V_{w}(s_t))^2] \\ \approx \text{E}_t[r_t + \gamma V_{w}(s_{t+1}) - V_{w}(s_t) \ln(\pi_{\theta}(a_t | s_t))] + \text{E}_t[(r_t + \gamma V_{w}(s_{t+1}) - V_{w}(s_t))^2]

We can notice here that the advantage and value difference using TD(0) target result in the TD(0) loss δt\delta_{t}. We can use TD(0) for simple loss computation, but it has high bias and low variance since it only looks one step ahead. In many cases, we want to balance the bias and variance depending on the environment by looking further with longer eligibility traces as follows.

J(θ)Et[rt+γrt+1+...+γ(Tt+1)r(Tt+1)+γ(Tt)Vw(s(Tt))Vw(st)ln(πθ(atst))]+Et[(rt+γrt+1+...+γ(Tt+1)r(Tt+1)+γ(Tt)Vw(s(Tt))Vw(st))2] J(\theta) \approx \text{E}_t[r_t + \gamma r_{t+1} + ... + \gamma^{(T-t+1)} r_{(T-t+1)} + \gamma^{(T-t)} V_{w}(s_{(T-t)}) - V_{w}(s_t) \ln(\pi_{\theta}(a_t | s_t))] \\ + \text{E}_t[(r_t + \gamma r_{t+1} + ... + \gamma^{(T-t+1)} r_{(T-t+1)} + \gamma^{(T-t)} V_{w}(s_{(T-t)}) - V_{w}(s_t))^2]

The following code implements the actor-critic method that uses advantage instead of Q value and eligibility trace:

class ActorCriticNetwork(nn.Module):
    def __init__(self):
        super(ActorCriticNetwork, self).__init__()
        self.state_processing = nn.Linear(16, 8)
        self.actor = nn.Linear(8, 4)
        self.critic = nn.Linear(8, 1)
 
    def forward(self, x):
        state_embed = F.relu(self.state_processing(x))
        action_probs = F.softmax(self.actor(state_embed), dim=0)
        value = self.critic(state_embed)
        return action_probs, value
 
def Actor_Critic(env, network, optimizer, num_episodes, gamma, len_trace):
  stats = 0.0
  for episode in range(num_episodes):
      state, _ = env.reset()
      done = False
      results_list = []
      result_sum = 0.0
 
      with torch.no_grad():
          action, _ = network(one_hot_encoding(state))
          sampled_action = torch.multinomial(action, 1)[0]
 
      # Policy Evaluation
      network.eval()
      while not done:
        state, reward, done, _, _ = env.step(sampled_action.item())
 
        with torch.no_grad():
          next_action, next_value = network(one_hot_encoding(state))
          next_sampled_action = torch.multinomial(action, 1)[0]
 
        results_list.append((state, sampled_action.item(), reward))
        result_sum += reward
        action = next_action
        sampled_action = next_sampled_action
 
        # Policy Improvement
        if (len(results_list) == len_trace) or done:
          target_value = 0 if done else next_value
          for t, (state, sampled_action, reward) in enumerate(reversed(results_list)):
              target_value = reward + gamma * target_value
              network.eval()
              action, value = network(one_hot_encoding(state))
              network.train()
              td_loss = target_value - value # advantage / v_target - v
              actor_loss = policy_loss(action, sampled_action, td_loss, gamma, 0)
              critic_loss = torch.mean(td_loss**2)
              loss = actor_loss + critic_loss
              optimizer.zero_grad()
              loss.backward()
              optimizer.step()
        results_list.pop(0)
 
      stats += result_sum
      if episode % 10 == 0 and episode != 0:
          print(f"episode: {episode}, success: {stats/10}")
          stats = 0.0
 
  print(f"episode: {episode}, success: {stats/episode}")
 
  env.close()
 
network = ActorCriticNetwork()
optimizer = torch.optim.Adam(network.parameters(), lr=learning_rate)
Actor_Critic(env, network, optimizer, num_episodes, gamma, len_trace=5)

The target value is computed by recursively summing up the rewards. Due to the use of advantage, we can unify the actor and critic networks, which can be trained using a combined loss. In this model, it's essential to choose the right length of eligibility trace for balancing bias and variance.

GAE

Alternatively, we can compute the advantage by summing up the discounted TD(0) losses of future states (advantage of future states) scaled by the hyperparameter λ\lambda as follows.

At=l=0λlγl(rt+l+γVw(st+l+1)Vw(st+l)) A_t = \sum_{l=0}^{\infty} \lambda^l \gamma^l(r_{t+l} + \gamma V_w(s_{t+l+1}) - V_w(s_{t+l}))

When λ\lambda is set to 0, only the initial TD loss (when l=0l=0) is non-zero, which makes the advantage the same as the TD(0) loss. On the other hand, when λ\lambda is set to 1, we are summing up all the future rewards, making it equivalent to the Monte Carlo method. Hence, by changing the value of the scaler λ\lambda between 0 and 1, we can control the bias and variance. This method of generalizing advantage computation on a spectrum between TD(0) and Monte Carlo methods is called Generalized Advantage Estimation (GAE), and it is used across various actor-critic methods. (In practice, we set an upper bound TT to sum up the future TD(0) loss.)

Conclusion

In this article, we introduced policy gradient methods, the policy gradient theorem behind these methods, and REINFORCE and actor-critic, which are on-policy policy gradient methods. We also demonstrated how advantage and eligibility traces can be implemented in the context of the actor-critic method and GAE. As demonstrated, these on-policy methods struggle to learn due to the non-i.i.d. data and difficulty with exploration. In the next article, we will introduce some of the more advanced algorithms that aim to overcome these issues.

Resources