Road to ML Engineer #44 - Policy Gradient Methods Cont'd

The blog post continues the discussion on the policy gradient methods in reinforcement learning.

In the previous article, we introduced policy gradient theorem and two on-policy policy gradient methods, REINFORCE and actor-critic. Although they have better convergence guarantees than state-action value methods in theory, they still struggle to converge due to non-i.i.d. data and difficulties with exploration. In this article, we try to address these problems with more advanced policy gradient methods.

A3C

When the actor-critic method takes time to converge, a trivial way of making convergence faster is by using concurrency. Specifically, we can have a global actor with the target policy and a critic that we update using the gradients from the experiences collected by multiple workers, each running an actor acting based on a behavior policy in an environment per CPU core. Once a worker reaches termination or the maximum number of time steps, we can compute the gradients using the experiences, update the parameters of the global target policy and critic, and copy the parameters to the behavior policy and the critic in the worker for it to continue its exploration with the updated policy.

This process can continue until we reach a certain total number of time steps. When running multiple workers, a worker might end up terminating faster than the other workers. In such cases, we do not wait for the other workers to finish and just update the parameters of the global actor and critic while other workers are still exploring, which makes the parameter updates asynchronous and allows full CPU utilization. Also, the critic here can use advantage for the loss, allowing the critic to output $V(s, w)$ instead of $Q(s, a, w)$ . This method of training the global advantage actor and critic by updating parameters asynchronously using the gradients from multiple workers running local actors and critics is called Asynchronous Advantage Actor-Critic (A3C).

We can interpret A3C as performing mini-batch gradient updates using multiple workers sampling the batch in parallel or using a small-sized replay buffer. You can check an in-depth explanation of the algorithm and PyTorch implementation of A3C in the video linked here by Machine Learning with Phil (2022). (The only core differences from the actor-critic method in the previous article are the use of torch.multiprocessing and accumulation of gradients.) As demonstrated in the video, A3C assumes that the target and behavior policies are the same and computes on-policy policy gradients, which might not be a good assumption. It is also sensitive to hyperparameters, and there is observable run-to-run variation in learning outcomes.

A2C

The researchers at OpenAI analyzed A3C and identified no evidence that asynchronous updates, set up for CPU utilization, improve the performance of the model. They then came up with A2C, which removes the asynchronicity from A3C by introducing a coordinator between the global target actor-critic and workers that can synchronize weight updates. By updating the parameters at the same time, we can ensure that all the actor-critic models in the workers use the same latest set of parameters as the global target actor-critic, making A2C truly on-policy and potentially achieving faster convergence.

We can interpret A2C as combining mini-batches into a batch to update the parameters, which can utilize GPUs more efficiently. In fact, OpenAI researchers claim they observe better performance and faster convergence when using larger policies. When implementing A2C, you may consider using semaphores and alternative approaches that let the coordinator wait for all the workers to finish their exploration.

Off-Policy Policy Gradient

Although A3C does not reflect the introduction of a behavior policy, introducing the behavior policy $\beta(a|s)$ motivates us to rewrite the reward function $J(\theta)$ and policy gradient $\nabla_{\theta} J(\theta)$ as follows.

J(\theta) = \sum_{s \in S} d_{\beta}(s) V_{\pi_{\theta}}(s) = \sum_{s \in S} d_{\beta}(s) \sum_{a \in A} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) \\ \frac{d}{d \theta} J(\theta) = \sum_{x \in S} d_{\beta}(s) \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) + \pi_{\theta}(a | s) \frac{d}{d \theta}Q_{\pi_{\theta}}(s, a)

The gradient above is expanded using the product rule. We can use the proof by Degris et al. (2013) that convergence to the true local minimum is still guaranteed even when we use an approximated gradient that ignores $\nabla_\theta Q_{\pi_\theta}(s, a)$ to simplify the gradient computation as follows.

\frac{d}{d \theta} J(\theta) = \sum_{x \in S} d_{\beta}(s) \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) + \pi_{\theta}(a | s) \frac{d}{d \theta}Q_{\pi_{\theta}}(s, a)\\ = \sum_{x \in S} d_{\beta}(s) \sum_{a \in A} \frac{d}{d \theta} \pi_{\theta}(a | s) Q_{\pi_{\theta}}(s, a) \\ = \sum_{x \in S} d_{\beta}(s) \sum_{a \in A} \beta(a | s) \frac{\pi_{\theta}(a | s)}{\beta(a | s)} Q_{\pi_{\theta}}(s, a) \frac{\frac{d}{d \theta} \pi_{\theta}(a | s)}{\pi_{\theta}(a | s)} \\ = \text{E}_{s \sim d_{\beta}, a \sim \beta} [\frac{\pi_{\theta}(a | s)}{\beta(a | s)} Q_{\pi_{\theta}}(s, a) \frac{d}{d \theta} \ln(\pi_{\theta}(a | s))]

We can notice here that the only adjustment needed for off-policy policy gradient is to use a weighted average, whose weights are the importance sampling ratio $\frac{\pi_\theta(a | s)}{\beta(a | s)}$ .

TRPO

In an A3C setting, the behavior policy in the workers is similar but technically different from the target policy and should be taken into account when computing the loss and its gradient for better parameter updates. Introducing the behavior policy in the worker $\pi_{\theta_{\text{old}}}$ , we can write the gradient of the objective function $\nabla_\theta J(\theta)$ using the above as follows.

\frac{d}{d \theta} J(\theta) = \text{E}_{\theta_{\text{old}}} [\frac{\pi_{\theta}(a | s)}{\pi_{\theta_{\text{old}}}(a | s)} Q_{\pi_{\theta}}(s, a) \frac{d}{d \theta} \ln(\pi_{\theta}(a | s))]

Since the behavior policy in the workers computes the objective using the Q-value it acquired, and we can assume the behavior policy to be reasonably similar to the target policy, we can approximate the objective using $Q_{\pi_{\theta_{\text{old}}}}(s, a)$ instead.

\frac{d}{d \theta} J(\theta) \approx \text{E}_{\theta_{\text{old}}} [\frac{\pi_{\theta}(a | s)}{\pi_{\theta_{\text{old}}}(a | s)} Q_{\pi_{\theta_{\text{old}}}}(s, a) \frac{d}{d \theta} \ln(\pi_{\theta}(a | s))]

The above expression can be further derived by expanding the derivative of the natural log as follows.

\frac{d}{d \theta} J(\theta) \approx \text{E}_{\theta_{\text{old}}} [\frac{\pi_{\theta}(a | s)}{\pi_{\theta_{\text{old}}}(a | s)} Q_{\pi_{\theta_{\text{old}}}}(s, a) \frac{\frac{d}{d \theta}(\pi_{\theta}(a | s))}{\pi_{\theta}(a | s)}] \\ = \text{E}_{\theta_{\text{old}}} [\frac{\frac{d}{d \theta}(\pi_{\theta}(a | s))}{\pi_{\theta_{\text{old}}}(a | s)} Q_{\pi_{\theta_{\text{old}}}}(s, a)] \\ J(\theta) \approx \text{E}_{\theta_{\text{old}}} [\frac{\pi_{\theta}(a | s)}{\pi_{\theta_{\text{old}}}(a | s)} Q_{\pi_{\theta_{\text{old}}}}(s, a)]

We can use the advantage to stabilize learning and simplify the expression using $r(\theta) = \frac{\pi_\theta(a | s)}{\pi_{\theta_{\text{old}}}(a | s)}$ and reach the final surrogate objective.

J(\theta) \approx \text{E}_{\theta_{\text{old}}} [r(\theta) A_{\pi_{\theta_{\text{old}}}}(s, a)]

The surrogate objective assumes $\theta \approx \theta_{\text{old}}$ , which can be maintained by using a synchronous approach or by ensuring that each parameter update has a small step size (learning rate). The Trust Region Policy Optimization (TRPO) ensures the step size is small by imposing a constraint on KL divergence as follows.

\text{E}_{s \sim d_{\pi_{\theta_{\text{old}}}}} [D_{\text{KL}}(\pi_{\theta_{\text{old}}}(.|s)||\pi_{\theta}(.|s))] \leq \delta

Here, $\pi(.|s)$ denotes the action distribution and $\delta$ is the hyperparameter. The approximate gradients can be solved efficiently using the conjugate gradient method, and sufficient step size that satisfies the constraint can be determined by backtracking line search (which are unfortunately outside the scope of this article). Proper and careful parameter updates over policy iteration offer more stable learning with monotonic policy improvement guarantees (proven in the paper) at the cost of more computation than naive A3C.

PPO

Instead of applying the constraint, we can add (subtract since we are performing gradient ascent) the KL-divergence penalty scaled by the hyperparameter $\beta$ , which allows us to use conventional optimization algorithms and implement it easily.

J(\theta) \approx \text{E}_{\theta_{\text{old}}} [r(\theta) A_{\pi_{\theta_{\text{old}}}}(s, a)] - \beta \text{E}_{s \sim d_{\pi_{\theta_{\text{old}}}}} [D_{\text{KL}}(\pi_{\theta_{\text{old}}}(.|s)||\pi_{\theta}(.|s))]

However, choosing a fixed $\beta$ cannot encourage appropriate trust regions. Hence, we can adjust $\beta$ depending on whether the KL divergence is smaller or larger than the target KL divergence. Specifically, we can make the $\beta$ half when the KL divergence is smaller than two-thirds of the target to allow larger updates and double the $\beta$ when the KL divergence is larger than 1.5 times the target to penalize the update. So far, we have been using KL divergence as a measure of how different $\pi_{\theta}$ and $\pi_{\theta_{\text{old}}}$ are, but we can also use $r(\theta)$ as closer the value of $r(\theta)$ is to 1, the more similar the two policies are. Hence, we can think of clipping the $r(\theta)$ to be within a certain range to avoid large step sizes and achieve a similar effect to trust regions.

J(\theta) \approx \text{E}_{\theta_{\text{old}}} [\text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) A_{\pi_{\theta_{\text{old}}}}(s, a)]

Here, the $\epsilon$ is the hyperparameter for setting the threshold for the update. When $r(\theta)$ is outside these boundaries, we do not allow policy updates. However, the above approach is not ideal since we end up clipping the ratio by $1 + \epsilon$ with the negative advantage (where $r(\theta) A < (1 + \epsilon)A$ ) or clipping the ratio by $1 - \epsilon$ with the positive advantage (where $r(\theta)A < (1 - \epsilon)A$ ). To clip the ratio only when we result in a more conservative surrogate objective, we can modify the above as follows.

J(\theta) \approx \text{E}_{\theta_{\text{old}}} [\min(r(\theta)A_{\pi_{\theta_{\text{old}}}}(s, a), \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) A_{\pi_{\theta_{\text{old}}}}(s, a))]

By applying the above as the surrogate objective, we can achieve the same effect of applying the KL divergence constraint or penalty with simpler computation. When combining the squared error for the advantage critic and entropy bonus for the policy $H_{\pi_{\theta}}(s)$ that encourages exploration, we can arrive at the objective function $J^{\text{PPO}}(\theta)$ for the Proximal Policy Optimization (PPO), which achieves matching results with TRPO with much simpler computation.

H_{\pi_{\theta}}(s) = - \sum_{a} \pi_{\theta}(a | s) \log(\pi_{\theta}(a | s)) \\ J^{\text{CLIP}}(\theta) = \text{E}_{\theta_{\text{old}}} [\min(r(\theta)A_{\pi_{\theta_{\text{old}}}}(s, a), \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) A_{\pi_{\theta_{\text{old}}}}(s, a))] \\ J^{\text{PPO}}(\theta) = \text{E}[J^{\text{CLIP}}(\theta) - c_1 (V_{*}(s) - V_{\theta}(s))^2 + c_2 H_{\pi_{\theta}}(s)]

Here, $c_1$ and $c_2$ are hyperparameters for adjusting the impact of critic loss and entropy bonus to the overall objective function. The entropy bonus $H_{\pi_{\theta}}(s)$ is in other words the expected surprise, making it larger when the policy is close to a uniform distribution, which encourages more explorations. It is essential to reiterate that we perform gradient ascent on this objective function, which is why we subtract the squared error. (The example PyTorch implementation is available in the video here by Machine Learning with Phil (2021).)

Due to PPO's simplicity and great performances, it is used in various complex reinforcement learning problems (playing Dota 2, controlling robots, etc.) and demonstrating remarkable success. It is also used to fine-tune LLMs like ChatGPT to generate answers that humans prefer using human feedback as rewards (called reinforcement learning from human feedback or RLHF) after supervised learning for the next token prediction using the Common Crawl. Even DeepSeek-R1, which recently took the world by storm, uses a PPO variant called GRPO for its entire training, showing its incredible capability and exciting opportunity.

Conclusion

In this article, we introduced some of the advanced actor-critic methods: A3C, A2C, TRPO, and PPO. While A2C is the only algorithm that is truly on-policy, all of the methods are commonly considered as on-policy policy gradient methods. For all the off-policy methods, like DPG, which learns a deterministic policy, we can use the importance sampling ratio to correct the objective. There are many other policy gradient methods like DPG, SAC, and TD3, and I recommend checking them out if you are interested.

Resources

cwkx. 2021. Reinforcement Learning 8: Policy gradient methods. YouTube.
DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. GitHub.
Machine Learning with Phil. 2021. Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial. YouTube.
Machine Learning with Phil. 2022. Multicore Deep Reinforcement Learning | Asynchronous Advantage Actor Critic (A3C) Tutorial (PYTORCH). YouTube.
OpenAI. n.d. Baselines. GitHub.
Schulman, J. et al. 2017. Trust Region Policy Optimization. ArXiv.
Schulman, J. et al. 2017. Proximal Policy Optimization Algorithms. ArXiv.
Spinning Up. n.d. Welcome to Spinning Up in Deep RL!. OpenAI.
Weng, L. 2018. Policy Gradient Algorithms. Lil'Log.
Wu, Y. et al. 2017. OpenAI Baselines: ACKTR & A2C. OpenAI.