The blog post introduces the policy gradient methods in reinforcement learning.

So far, we have been discussing state-action value methods, where we update the estimations of state-action values provided by a table or a function to deduce a greedy or ε-greedy policy. However, we discovered that the tabular approach has high memory requirements, and the functional approach struggles with convergence for real-life applications. In this article, we will discuss a new approach, the policy gradient method, which aims to solve these problems.
Policy Gradient
Instead of updating the estimations of the state-action values, we can directly learn policy that maximizes rewards by fitting a parameterized policy function with respect to its parameters . The reward function that we try to maximize is determined by the expected value of the state-value under the policy.
To compute this expected value, we can use the stationary distribution of the Markov chain's states, denoted as . We can then compute the gradient of the reward function with respect to its parameters , which will be used for gradient ascent to learn the optimal set of parameters. This method tends to converge smoother than functional approximation on state-value functions and is more effective in high-dimensional or continuous action spaces, though it may converge to a local minimum and face bias-variance tradeoff.
Policy Gradient Theorem
To compute the gradient of the reward function, we start by computing , which is equivalent to . Using the product rule, we can derive the derivative as follows.
We notice that there is a recursive structure in the above expression. To clarify this relationship, we set up two new functions: , which represents the state transition probability from to under the policy in steps, and . Using these functions, we can expand the recursive structure as follows.
This expansion is valid because and . Here, we can see that the gradient of the reward function is proportional to the above expression, which can be derived further as follows.
By observing this expansion, we can see that the gradient of the reward function is proportional to the expected value of the product between the state-action value and the derivative of the natural log of the policy. This proportionality is known as the Policy Gradient Theorem, which allows us to simplify the computation of the gradient for performing gradient ascent. For our intuitive understanding, we can interpret this computation as adjusting the parameters in the direction that maximizes the likelihood of taking actions with the highest state-action values or advantages.
Here, we can also use the advantage instead of to stabilize learning. For making use of deep learning frameworks that compute the gradient from the loss function and perform optimization with various optimizers, we can use the objective , whose gradients can be computed as .
REINFORCE
REINFORCE (Monte Carlo policy gradient) samples an episode and uses as an approximation of to compute the gradient (). The following is an example implementation of a parameterized non-linear policy function (feedforward neural policy network) and the REINFORCE algorithm for training the policy network on the Frozen Lake environment from Gymnasium.
class PolicyNetwork(nn.Module):
def __init__(self):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(16, 8)
self.fc2 = nn.Linear(8, 4)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.softmax(x, dim=0)
def one_hot_encoding(index):
encoding = np.array([1 if i == index else 0 for i in range(16)])
return torch.tensor(encoding, dtype=torch.float32)
def policy_loss(action, sampled_action, result_sum, gamma, t):
log_prob = torch.log(action[sampled_action])
loss = log_prob * gamma**t * result_sum
loss = -torch.mean(loss)
return loss
def REINFORCE(env, policy_network, optimizer, num_episodes, gamma):
stats = 0
for episode in range(num_episodes):
state, _ = env.reset()
done = False
results_list = []
result_sum = 0.0
# Policy Evaluation
policy_network.eval()
while not done:
action = policy_network(one_hot_encoding(state))
sampled_action = torch.multinomial(action, 1)[0]
new_state, reward, done, _, _ = env.step(sampled_action.item())
results_list.append((state, action, sampled_action.item()))
result_sum += reward
state = new_state
# Policy Improvement
if result_sum != 0:
policy_network.train()
for t, (state, action, sampled_action) in enumerate(results_list):
# Backpropagation using Policy Gradient Theorem
loss = policy_loss(action, sampled_action, result_sum, gamma, t)
optimizer.zero_grad()
loss.backward()
optimizer.step()
stats += result_sum
if episode % 10 == 0 and episode != 0:
print(f"episode: {episode}, success: {stats/10}")
stats = 0
print(f"episode: {episode}, success: {stats/episode}")
env.close()
learning_rate = 0.001
gamma = 1
num_episodes = 1000
policy_network = PolicyNetwork()
optimizer = torch.optim.Adam(policy_network.parameters(), lr=learning_rate)
REINFORCE(env, policy_network, optimizer, num_episodes, gamma)
After sampling an episode, we backtrack the episode and update the parameters by . Since the added term is zero when the return is zero, and the return is mostly zero in this environment, it takes time for learning to start and even longer to converge to the optimal policy.
Actor-Critic
Instead of using Monte Carlo methods for computing the policy gradient, we can utilize TD(0) by training the function approximator along with and updating the policy at every step using the output of (). This method is called actor-critic since we can perceive as an actor and as a critic. The following is an example code implementation of actor-critic in the Frozen Lake environment.
class QNetwork(nn.Module):
def __init__(self):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(20, 10)
self.fc2 = nn.Linear(10, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
def q_one_hot_encoding(state, action):
state_encoding = one_hot_encoding(state)
action_encoding = np.array([1 if i == action else 0 for i in range(4)])
action_encoding = torch.tensor(action_encoding, dtype=torch.float32)
return torch.cat((state_encoding, action_encoding), dim=0)
def Actor_Critic(env, policy_network, q_network, policy_optimizer, q_optimizer, num_episodes, gamma):
torch.autograd.set_detect_anomaly(True)
for episode in range(num_episodes):
state, _ = env.reset()
done = False
action = policy_network(one_hot_encoding(state))
sampled_action = torch.multinomial(action, 1)[0]
q_value = q_network(q_one_hot_encoding(state, sampled_action.item()))
while not done:
# Forward Pass
policy_network.eval()
q_network.eval()
new_state, reward, done, _, _ = env.step(sampled_action.item())
next_action = policy_network(one_hot_encoding(new_state))
next_sampled_action = torch.multinomial(next_action, 1)[0]
# Polocy Update
policy_network.train()
loss = policy_loss(action, sampled_action, q_value, gamma, 0)
policy_optimizer.zero_grad()
loss.backward()
policy_optimizer.step()
# Q Update
q_value_next = q_network(q_one_hot_encoding(new_state, next_sampled_action.item()))
q_network.train()
td_loss = (reward + gamma * q_value_next - q_value)**2
q_optimizer.zero_grad()
td_loss.backward()
q_optimizer.step()
state = new_state
action = next_action
sampled_action = next_sampled_action
q_value = q_value_next
if episode % 10 == 0 and episode != 0:
print(f"episode: {episode}/{num_episodes}")
print(f"episode: {episode}/{num_episodes}")
env.close()
q_network = QNetwork()
policy_optimizer = torch.optim.Adam(policy_network.parameters(), lr=learning_rate)
q_optimizer = torch.optim.Adam(q_network.parameters(), lr=learning_rate)
Actor_Critic(env, policy_network, q_network, policy_optimizer, q_optimizer, num_episodes, gamma)
If the state is complex, the two networks can share the same network to produce the latent state embedding. Although this naive actor-critic method tends to have faster iteration, it still struggles to learn the policy since the data is not i.i.d.
Eligibility Traces
As mentioned earlier, we can use advantage instead of Q value in the policy objective to stabilize learning and allow the critic to output only the state value. This change enables us to share the state processing part between the policy and critic networks, simplifying the model architecture. When using TD(0), the objective function for the actor and critic becomes the following:
We can notice here that the advantage and value difference using TD(0) target result in the TD(0) loss . We can use TD(0) for simple loss computation, but it has high bias and low variance since it only looks one step ahead. In many cases, we want to balance the bias and variance depending on the environment by looking further with longer eligibility traces as follows.
The following code implements the actor-critic method that uses advantage instead of Q value and eligibility trace:
class ActorCriticNetwork(nn.Module):
def __init__(self):
super(ActorCriticNetwork, self).__init__()
self.state_processing = nn.Linear(16, 8)
self.actor = nn.Linear(8, 4)
self.critic = nn.Linear(8, 1)
def forward(self, x):
state_embed = F.relu(self.state_processing(x))
action_probs = F.softmax(self.actor(state_embed), dim=0)
value = self.critic(state_embed)
return action_probs, value
def Actor_Critic(env, network, optimizer, num_episodes, gamma, len_trace):
stats = 0.0
for episode in range(num_episodes):
state, _ = env.reset()
done = False
results_list = []
result_sum = 0.0
with torch.no_grad():
action, _ = network(one_hot_encoding(state))
sampled_action = torch.multinomial(action, 1)[0]
# Policy Evaluation
network.eval()
while not done:
state, reward, done, _, _ = env.step(sampled_action.item())
with torch.no_grad():
next_action, next_value = network(one_hot_encoding(state))
next_sampled_action = torch.multinomial(action, 1)[0]
results_list.append((state, sampled_action.item(), reward))
result_sum += reward
action = next_action
sampled_action = next_sampled_action
# Policy Improvement
if (len(results_list) == len_trace) or done:
target_value = 0 if done else next_value
for t, (state, sampled_action, reward) in enumerate(reversed(results_list)):
target_value = reward + gamma * target_value
network.eval()
action, value = network(one_hot_encoding(state))
network.train()
td_loss = target_value - value # advantage / v_target - v
actor_loss = policy_loss(action, sampled_action, td_loss, gamma, 0)
critic_loss = torch.mean(td_loss**2)
loss = actor_loss + critic_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
results_list.pop(0)
stats += result_sum
if episode % 10 == 0 and episode != 0:
print(f"episode: {episode}, success: {stats/10}")
stats = 0.0
print(f"episode: {episode}, success: {stats/episode}")
env.close()
network = ActorCriticNetwork()
optimizer = torch.optim.Adam(network.parameters(), lr=learning_rate)
Actor_Critic(env, network, optimizer, num_episodes, gamma, len_trace=5)
The target value is computed by recursively summing up the rewards. Due to the use of advantage, we can unify the actor and critic networks, which can be trained using a combined loss. In this model, it's essential to choose the right length of eligibility trace for balancing bias and variance.
GAE
Alternatively, we can compute the advantage by summing up the discounted TD(0) losses of future states (advantage of future states) scaled by the hyperparameter as follows.
When is set to 0, only the initial TD loss (when ) is non-zero, which makes the advantage the same as the TD(0) loss. On the other hand, when is set to 1, we are summing up all the future rewards, making it equivalent to the Monte Carlo method. Hence, by changing the value of the scaler between 0 and 1, we can control the bias and variance. This method of generalizing advantage computation on a spectrum between TD(0) and Monte Carlo methods is called Generalized Advantage Estimation (GAE), and it is used across various actor-critic methods. (In practice, we set an upper bound to sum up the future TD(0) loss.)
Conclusion
In this article, we introduced policy gradient methods, the policy gradient theorem behind these methods, and REINFORCE and actor-critic, which are on-policy policy gradient methods. We also demonstrated how advantage and eligibility traces can be implemented in the context of the actor-critic method and GAE. As demonstrated, these on-policy methods struggle to learn due to the non-i.i.d. data and difficulty with exploration. In the next article, we will introduce some of the more advanced algorithms that aim to overcome these issues.
Resources
- cwkx. 2021. Reinforcement Learning 8: Policy gradient methods. YouTube.
- Schulman, J. et al. 2017. Proximal Policy Optimization Algorithms. YouTube.
- Weng, L. 2018. Policy Gradient Algorithms. Lil'Log.