The blog post introduces some extended methods in reinforcement learning.

So far, we have been discussing various reinforcement learning methods, including value-based methods, policy gradient methods, and model-based methods. However, we can combine techniques from machine learning in ways that we have not seen yet to address a wide variety of reinforcement learning problems that naive methods cannot address. In this article, we will cover 2 such extended methods, the R2D2 and Dreamer approaches.
R2D2
For a process to be a Markov Decision Process (MDP), all state transitions must satisfy the Markov property, meaning all state transition probabilities must depend only on the current state and action. However, some reinforcement problem definitions are unfortunately not MDPs and are partially observable MDPs (POMDPs). For example, when we take snapshots of a game with moving parts as a state observation, which is a common thing to do for training an RL agent on a game, a snapshot cannot capture the velocity of an object in the game, which has a huge impact on the state transition.
To solve this problem, we can use RNN-based agents, which can naturally capture sequential inputs and thus treat a sequence of snapshots as a state to satisfy the MDP requirement. It is super intuitive to utilize RNNs in reinforcement learning that needs to learn from sequential data, though there were some difficulties that prevented us from applying RNNs directly. One of them was the incompatibility with prioritized experience replay, which stores individual state transitions, a critical component for distributing sampling to multiple workers and achieving high performance with DQN approaches. To make RNNs compatible with prioritized experience replay, we can think of storing the entire episode in the buffer, which is infeasible in some environments, or storing fixed-length (like 80 transitions) segments from the sequence.
However, when storing fixed-length segments, another difficulty arises. We need to restore the hidden state just before the segment starts for computing gradients. The naive approach is to run from zero initial hidden states until the segment starts, which is computationally inefficient. Hence, we can instead store the hidden states in the buffer, which can be used as the initial hidden state. However, the hidden states are computed by the workers and not the target learner, causing a discrepancy. To get around this, we can run the target learner for some number of time steps (like 20 time steps) using the stored hidden state, a process called burn-in, and start learning from the hidden state after the burn-in process.
However, when storing the fixed-length segments, another difficulty arises, which is that we need to restore the hidden state just before the segment starts for computing the gradients. The naive approach is to run from zero initial hidden states until the segment starts, though it is computationally inefficient. Hence, we can instead store hidden states in the buffer, which can be used as the initial hidden state. However, the hidden states are computed by the workers and not the target learner causing discrepancy. To get around this, we can run the target learner for some number of time steps (like 20 time steps) using the stored hidden state, the process called burn-in, and start learning from the hidden state after the burn-in process.
Recurrent Replay Distributed DQN (R2D2) uses the burn-in process to produce the initial hidden state from the stored hidden state and learns from fixed-length segments, collected by multiple workers, with the highest priority computed by a mixture of max -step TD error and average absolute -step TD error (). It also introduces other tricks, such as using LSTM, rescaling the rewards (making rewards smaller without clipping for stabilizing learning while retaining variance), using a high discount rate (), and using -greedy policies with fixed for each worker (computed by ) for varying degrees of exploration. With those tricks, R2D2 achieved incredible state-of-the-art performance for playing Atari games in 2018. As an extension, Never Give Up (NGU) introduces intrinsic reward to the reward function for better exploration, similar to Dyna-Q+.
Dreamers
To overcome the sample inefficiency of model-free methods, we came up with the idea of training a model to simulate the environment and perform planning. However, training an environment model that predicts next state and reward only from the current state and action might not work well for POMDPs. Hence, we can make use of RNNs and make the model produce predictions on the next state and reward based on its hidden state from the previous step (memory), the current state, and action.

Most real-world reinforcement learning problems have large state representations (like screenshots of a computer game), and making each RNN cell work with these large state representations is computationally expensive. Instead, we can train a variational auto-encoder (VAE) to produce smaller latent state representations that preserve important features and make the cells and actor-critic work with this latent representation. The above diagram shows how the world model is trained and used for training an actor-critic method via planning using this approach. This method is analogous to humans learning to dream about an environment in their brains and only learning from those dreams, hence the name Dreamer.
DreamerV2 introduced discrete latent state representations and KL divergence balancing (using a KL divergence loss) for encouraging latent state generation from the encoder and hidden states to be closer in both directions. This led to higher performance than Rainbow (a DQN with various techniques introduced in the article on function approximation) as a model-based method. DreamerV3 scaled up the model, introduced reward rescaling like R2D2, and used discrete regression approaches based on two-hot encoded -returns for critic learning. This led to higher performance and sample efficiency than its predecessors across various tasks without heavy hyperparameter tuning.
MuZero, the successor of AlphaZero, took a similar approach by learning to work with latent state representations within Monte Carlo Tree Search (MCTS) instead of the real environment, which led to achieving better sample-efficiency and higher performance than its predecessors in various domains. It even achieved results better than R2D2 in Atari games. Unlike the Dreamer approach, however, MuZero does not utilize a VAE and does not aim to produce distributional latent state representations or reconstruct the state with a decoder, which might be important in certain scenarios (where the environment is stochastic or explainability is crucial). Additionally, MuZero takes multiple frames of Atari as input states without relying on RNNs that process each frame, achieving a simpler architecture.
Conclusion
In this article, we discussed the main ideas behind R2D2 and Dreamer models that utilize RNNs to make POMDPs MDPs in value-based and model-based methods. We also briefly touched on MuZero, which uses latent state representations for higher sample efficiency in MCTS. For more details on implementation, I recommend checking out the papers and resources cited below. This article wraps up the reinforcement learning series (at least temporarily), covering the fundamental concepts you need to understand most of the recent papers in the field. I hope you gained something out of this series.
Resources
- cwkx. 2021. Reinforcement Learning 10: Extended methods. YouTube.
- Hafner, D. et al. 2021. Mastering Atari With Discrete World Models. ICLR 2021.
- Hafner, D. et al. 2023. Mastering Diverse Domains Through World Models. ArXiv.
- Kapturowski, S. et al. 2018. Recurrent Experience Replay In Distributed Reinforcement Learning. Open Review.
- Mayer, E. 2023. Model Based RL Finally Works!. YouTube.
- Schrittwieser, J. et al. 2020. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature.
- メンダコ. 2021. ray で実装する分散強化学習 ④R2D2. Hatena Blog.
- メンダコ. 2022. 世界モデルベース強化学習①: DreamerV2の実装. Hatena Blog.