The blog post introduces the dynamic programming apporach in reinforcement learning.

In the last two articles, we introduced Markov Decision Processes (MDP) and Gymnasium for reinforcement learning. Now that we understand the problem, we can start discussing solutions to it. In this article, we will cover the dynamic programming (DP) approach to reinforcement learning, which provides the essential foundations for understanding other methods we will discuss in the future.
(If you are unsure about dynamic programming, I recommend checking out Road to C++ Programmer #19 - Dynamic Programming.)
Dynamic Programming
In the last article, we covered the Bellman equation, which uncovered the recursive structure of the value function. It implies that DP can take advantage of this recursive relationship and solve MDPs if the MDP perfectly models the environment, although it involves immense computational cost. In reality, the DP approach is not practical given the assumptions of a perfect model and high computational cost. However, it is essential theoretically because all approaches to reinforcement learning can be interpreted as attempts to achieve the same effect as DP with less computation.
The essence of the DP approach in this context is to keep improving the policy based on approximated values until they converge to the values given by the Bellman optimality equation and no longer improve. The overall process is called policy iteration, as we iteratively go through policy evaluation to approximate the values and policy improvement that improves the policy based on these approximated values.
Policy Evaluation
First, we need to approximate the value functions of the current policy in order to improve the policy based on them. The algorithm here, iterative policy evaluation, iteratively applies the Bellman equation as an update rule to obtain an approximation of the values of all the states. The following is the update rule used to improve the approximation from the th step to the th step.
Initially, we can set for all states , and then apply the above update rule to gradually compute the expected value.
(This process is similar to how the Bellman-Ford algorithm gradually updates its estimates to find the path with minimum cost.)
As approaches , we can guarantee convergence to the actual values as the effects of each state propagate to all other states.
Convergence may occur earlier, at which point we can terminate early and move on to policy improvement.
The following is the code for the policy evaluation algorithm for the FrozenLakeEnvironment
that we set up in the previous article.
## Coordinate to ID Conversions
def num_to_coord(s):
return [s // 4, s % 4]
def coord_to_num(coord):
return coord[0] * 4 + coord[1]
## State-Action Transition (Env only sets up `step` that only take action,
## but we need to be able to take action from any state for calculation.)
def transition_model(env, state, action):
# If agent already has terminated, return the current state and 0 reward
if state == 15:
return [[1, state, 0, True]]
if state in [5, 7, 11, 12]:
return [[1, state, 0, True]]
# Else, move and return the result
state = num_to_coord(state)
env._agent_location = state
direction = env._action_to_direction[action]
env._agent_location = np.clip(
env._agent_location + direction, 0, 3
)
success = np.array_equal(env._agent_location, env._target_location)
fail = any([np.array_equal(env._agent_location, location) for location in env._hole_locations])
terminated = any([success, fail])
reward = 1 if success else 0
observation = env._agent_location
observation = coord_to_num(observation)
# if non deterministic, we would have [prob, observation, reward, terminated] for
# different observations.
return [[1, observation, reward, terminated]]
## Policy Evaluation
def policy_evaluation(env, policy, gamma=1, theta=1e-8, draw=False):
V = np.zeros(16)
while True:
delta = 0 # value change
for s in range(16):
Vs = 0 # state value
for a, action_prob in enumerate(policy[s]):
for prob, next_state, reward, done in transition_model(env, s, a):
Vs += action_prob * prob * (reward + gamma * V[next_state]) # update rule
delta = max(delta, np.abs(V[s]-Vs)) # clip negative change
V[s] = Vs # value update
if draw: plot(V,policy,draw_vals=True)
if delta < theta: # finish when value change is smaller than theta
break
return V
env = FrozenLakeEnvironment()
policy = np.ones([16, 4]) / 4 # Start by evaluating first policy
V = policy_evaluation(env, policy, draw=True)
# Resulting V of each cell from 0 to 15
# array([0.01393977, 0.01163091, 0.02095297, 0.01047648, 0.01624865,
# 0. , 0.04075153, 0. , 0.03480619, 0.08816993,
# 0.14205316, 0. , 0. , 0.17582037, 0.43929118,
# 0. ])
As we need to apply the update rule to all states, we must first set up a transition_model
,
which allows us to perform an action from any given state. After applying policy evaluation, we can use the V
,
the approximation of the values of all the states under the current policy, to improve the policy.
Policy Improvement
The way in which we can improve the policy based on the value estimations while ensuring that the new policy cannot be worse than the previous one is to choose the action that maximizes the value of the next state or to construct a greedy policy. We can mathematically express this using the state-action value function as follows.
We can be sure that the new greedy policy cannot be worse because the value function under the current policy is accurately estimated. We can accomplish the policy improvement in code as shown below.
## Create Q from V (rowwise)
def q_from_v(env, V, s, gamma=1):
q = np.zeros(4) # row of q corresponding to state s
for a in range(4):
for prob, next_state, reward, done in transition_model(env, s, a):
q[a] += prob * (reward + gamma * V[next_state]) # Bellman equation
return q
## Policy Improvement
def policy_improvement(env, V, gamma=1):
policy = np.zeros([16, 4]) / 4 ## New policy
for s in range(16):
q = q_from_v(env, V, s, gamma)
# deterministic policy, will always choose an action (not capture the distribution)
# policy[s][np.argmax(q)] = 1
# stochastic policy that puts equal probability on maximizing actions
best_a = np.argwhere(q==np.max(q)).flatten()
policy[s] = np.sum([np.eye(4)[i] for i in best_a], axis=0)/len(best_a)
return policy
new_policy = policy_improvement(env, V)
# Resulting Policy
# array([[0. , 0. , 1. , 0. ],
# [0. , 0. , 0. , 1. ],
# [0. , 0. , 1. , 0. ],
# [0. , 1. , 0. , 0. ],
# [0. , 0. , 1. , 0. ],
# :
# [0.25, 0.25, 0.25, 0.25],
# [0.25, 0.25, 0.25, 0.25],
# [0. , 0. , 0. , 1. ],
# [0. , 0. , 0. , 1. ],
# [0.25, 0.25, 0.25, 0.25]])
Here, we established how to approximate the value of a policy, and how to improve the policy based on it. The goal is to reach the optimal policy corresponding to the Bellman optimality equation. To do so, we can simply run this cycle of policy evaluation and improvement iteratively until the values converge. The following is the code implementation of policy iteration.
def policy_iteration(env, gamma=1, theta=1e-8):
policy = np.ones([16, 4]) / 4
while True:
# evaluate the policy (get the value function)
V = policy_evaluation(env, policy, gamma, theta)
# greedily choose the best action
new_policy = policy_improvement(env, V)
if np.max(abs(policy_evaluation(env, policy) - policy_evaluation(env, new_policy))) < theta*1e2:
break
policy = copy.copy(new_policy)
return policy, V
policy, V = policy_iteration(env)
Under the assumption of a perfect model and unlimited computational resources, we can use policy iteration to solve an MDP.
Value Iteration
The problem with the policy iteration approach is that there is no limit to the number of iterations in policy evaluation. In a simple case, we can reach convergence quite fast with this approach, but we can only do one iteration in policy evaluation and still guarantee convergence, by the process called value iteration. The value iteration method can also be thought of as using Bellman optimality equation instead of the Bellman equation for the update rule, effectively combining policy evaluation and improvement.
It intuitively makes sense to update the value with only the most valuable state since the optimal policy will choose the action to move to the state with the highest value anyway. The following shows the code implementation of value iteration.
def value_iteration(env, gamma=1, theta=1e-8):
V = np.zeros(16)
# Combines policy evaluation & improvement
# Using Bellman optimality equation
while True:
delta = 0
for s in range(16):
v_s = V[s]
q_s = q_from_v(env, V, s, gamma)
V[s] = max(q_s) # only choosing the action that leads to max state-action value
delta = max(delta, abs(V[s] - v_s))
if delta < theta: break
# Get the optimal policy at last
policy = policy_improvement(env, V, gamma)
return policy, V
policy, V = value_iteration(env)
The number of iterations in value iteration might increase due to smaller changes made in value estimation per iteration, and including a few more iterations of policy evaluation might lead to faster convergence. However, all approaches eventually converge to an optimal policy regardless.
Conclusion
In this article, we covered the dynamic programming approach to solving MDPs (Markov Decision Processes), policy iteration, and value iteration. The approach is characterized by the use of memorization to update estimates of a state from successive states, specifically called bootstrapping. Many approaches use bootstrapping techniques, even those that do not assume a perfect model. In the next article, we will cover one that does not use bootstrapping and assumes a perfect model, and in the article after that, we will cover one that uses bootstrapping but does not assume a perfect model.
Resources
- cwkx. 2022. Reinforcement Learning 4: Dynamic programming. YouTube.
- Sutton, S. R. & Barto, G. A. 2015. Reinforcement Learning: An Introduction. Stanford.