The blog post introduces model-based methods in reinforcement learning.

So far, we have been working extensively with model-free algorithms, given that obtaining a perfectly simulated model outputting distributions of next states and rewards required for improving value and policy with dynamic programming is extremely hard. However, we tend to see the persistent issue of sample efficiency with model-free approaches, requiring taking many actions for explorations in the real environment before learning the optimal value and policy, which are costly depending on the environment we are working with.
Therefore, we can instead resort to a more relaxed model-based method, where we model the environment for sampling next states and rewards, and use the samples to learn the value and policy or perform planning. Though capturing the distributions of next states and rewards is difficult, learning to produce the samples of next states and rewards is relatively easier via simple supervised learning on neural networks.
Dyna Architecture
Model-free approaches take samples from the real environment to learn the value and policy, but we can use those samples to train a model to simulate the environment, which can then be used to plan or to further learn the value and policy using samples from the model. The model-based architecture that integrates direct learning and planning is called Dyna architecture, and the following is the example PyTorch implementation of Dyna architecture.
class EnvNetwork(nn.Module):
def __init__(self):
super(EnvNetwork, self).__init__()
self.embed_fn = nn.Linear(20, 16)
self.next_state_fn = nn.Linear(16, 16)
self.reward_fn = nn.Linear(16, 1)
def forward(self, x):
embed = F.relu(self.embed_fn(x))
next_state = F.softmax(self.next_state_fn(embed), dim=0)
reward = self.reward_fn(embed)
return next_state, reward
def DynaQ(env, env_network, optimizer, num_episodes, alpha, gamma, epsilon, n=5):
Q = np.zeros([16, 4])
stats = 0
for episode in range(num_episodes):
state_action_pairs = []
state, _ = env.reset()
done = False
result_sum = 0
while not done:
# Behavior policy: epsilon-greedy
if np.random.rand() > epsilon:
action = np.argmax(Q[state, :])
else:
action = env.action_space.sample()
next_state, reward, done, _, _ = env.step(action)
state_action_pairs.append((state, action))
# Direct Learning: deterministic greedy policy
if not done:
next_action = np.argmax(Q[next_state, :])
Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])
# EnvNetwork Training
next_state_pred, next_reward_pred = env_network(env_one_hot_encoding(state, action))
state_loss = F.cross_entropy(next_state_pred, one_hot_encoding(next_state))
reward_loss = F.mse_loss(next_reward_pred, torch.tensor([reward], dtype=torch.float32))
loss = state_loss + reward_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state
result_sum += reward
# Planning
for _ in range(n):
rand_idx = np.random.choice(len(state_action_pairs))
s, a = state_action_pairs[rand_idx]
with torch.no_grad():
next_state, reward = env_network(env_one_hot_encoding(s, a))
next_state = torch.argmax(next_state).item()
reward = reward.item()
next_action = np.argmax(Q[next_state, :])
Q[s, a] += alpha * (reward + gamma * Q[next_state, next_action] - Q[s, a])
stats += result_sum
if episode % 10 == 0 and episode != 0:
print(f"episode: {episode}, success: {stats/10}")
stats = 0
print(f"episode: {episode}, success: {stats/episode}")
env.close()
return Q
env_network = EnvNetwork()
optimizer = torch.optim.Adam(env_network.parameters(), lr=0.001)
Q = DynaQ(env, env_network, optimizer, num_episodes=1000, alpha=0.1, gamma=1, epsilon=0.4, n=5)
The above is the implementation of Dyna-Q, which uses tabular Q-learning method for direct learning and planning.
You can run the above code to confirm that Dyna-Q with a larger n
converges faster and has higher sample efficiency due to planning.
We can substitute the direct learning part of Dyna architecture with other methods, including function approximations.
The algorithm has an inherent risk by introducing approximation of the environment, which limits the quality of the value approximations to the model's quality. Another potential issue is that the environment model cannot capture changes in the environment and hinders exploration after a large number of iterations. This can be partially addressed by introducing an intrinsic reward (curiosity) term to the reward, where is the number of time steps that the state-action pair has not been tried and is a hyperparameter. The Dyna-Q method with intrinsic reward is called Dyna-Q+, which can be more explorative and robust to changes in the environment.
Decision-Time Planning
The model-based methods we covered so far, dynamic programming and Dyna, use planning and learn to make the best actions for many states before encountering the current state in the real environment, which is called backward planning. With the model or environment at hand, we can think of an alternative approach, decision-time planning, where we use planning and look further ahead after encountering the current state in the real environment to determine the best action at the current state.
One of the most intuitive decision-time planning algorithms is the rollout algorithm, which uses a rollout policy to interact with the model or environment and update the policy based on Monte Carlo value estimates. We can keep improving the rollout policy until the value changes are insignificant or time runs out. We can start from a random policy or a more sophisticated one (like neural networks), depending on the time and computational resources available. (To make the updates faster, we can run Monte Carlo trials in multiple processors and use GPUs for running environment models and policies in parallel.) The rollout policy and result of the Monte Carlo trials are typically reset for the next current state.
Monte Carlo Tree Search
The rollout algorithms use the same policy for running Monte Carlo trials and for making decisions as to which trajectories to explore. However, as the rollout policy improves from the trials, it can become harder to run trials efficiently and balance exploration and exploitation. Hence, we can instead use different policies, an informed policy called a tree policy for choosing the trajectories to explore that balances exploration and exploitation, and a simple rollout policy for running trials efficiently. The improved algorithm is called Monte Carlo Tree Search (MCTS), and it has proven to be surprisingly effective.

The tree policy could be a policy like the -greedy policy or the UCB selection rule, which adds an uncertainty term to state-action value and chooses the action that leads to the highest sum. (Here, is a hyperparameter for changing the degree of exploration, is the time step, and is the number of times the action is chosen up to time . More the action is chosen, less the uncertainty becomes, encouraging other actions to be selected.) After using the tree policy to decide which direction to explore, we can choose an unexplored action to reach the state where we start the trial from, and use the result of the trial using rollout policy to update the state-action values of the trajectory.
The above visualizes the trajectories of explorations in a tree and the 4-step process of MCTS. In the first step called selection, the tree policy is used to choose the trajectory to explore until it reaches the leaf node (the node with unexplored action), and we choose one of the unexplored actions to run trials from in the second step called expansion. Then, we run the trial with rollout policy in the third step called simulation, and use the result to update the values in the fourth step called backup. We can repeat the cycle until we run out of time or computational resources and choose the next action with the largest values or visit count. For the next current state, we typically retain the values of the next next states and discard other states to rerun MCTS on. (If you want to implement it, I would recommend implementing Node and MCTS classes that store relevant attributes and methods.)
AlphaGo, which won against an 18-time world champion Go player in 2016, uses MCTS algorithm with tree and rollout policies being complex and simple neural networks, pretrained using 30 million human positions and self-play positions, and the same rollout policy as the opponent in the environment during simulation phase. It also trains value functions during self-plays and uses both the predicted value and the outcome of rollouts to update state-action values. The successors of AlphaGo have continued using MCTS with various adjustments, demonstrating the effectiveness of MCTS.
Conclusion
In this article, we introduced the Dyna architecture, rollout algorithms, and MCTS. Like Dyna-Q, Dyna-Q+, and AlphaGo, model-based methods can be combined with supervised learning and model-free methods in multiple ways to achieve the best results depending on the environment. This article concludes our coverage of three main approaches, value-based methods, policy gradient methods, and model-based methods, which can be implemented and combined together in various ways. In the next article, we will discuss the methods that combine various techniques from machine learning in ways that we have not seen yet to address a wide variety of reinforcement learning problems.
Resources
- cwkx. 2021. Reinforcement Learning 8: Policy gradient methods. YouTube.
- Silver, D. et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature.
- Sutton, S. R. & Barto, G. A. 2015. Reinforcement Learning: An Introduction. Carnegie Mellon University.