Road to ML Engineer #39 - Dynamic Programming Approach

The blog post introduces the dynamic programming apporach in reinforcement learning.

In the last two articles, we introduced Markov Decision Processes (MDP) and Gymnasium for reinforcement learning. Now that we understand the problem, we can start discussing solutions to it. In this article, we will cover the dynamic programming (DP) approach to reinforcement learning, which provides the essential foundations for understanding other methods we will discuss in the future.

(If you are unsure about dynamic programming, I recommend checking out Road to C++ Programmer #19 - Dynamic Programming.)

Dynamic Programming

In the last article, we covered the Bellman equation, which uncovered the recursive structure of the value function. It implies that DP can take advantage of this recursive relationship and solve MDPs if the MDP perfectly models the environment, although it involves immense computational cost. In reality, the DP approach is not practical given the assumptions of a perfect model and high computational cost. However, it is essential theoretically because all approaches to reinforcement learning can be interpreted as attempts to achieve the same effect as DP with less computation.

The essence of the DP approach in this context is to keep improving the policy based on approximated values until they converge to the values given by the Bellman optimality equation and no longer improve. The overall process is called policy iteration, as we iteratively go through policy evaluation to approximate the values and policy improvement that improves the policy based on these approximated values.

Policy Evaluation

First, we need to approximate the value functions of the current policy in order to improve the policy based on them. The algorithm here, iterative policy evaluation, iteratively applies the Bellman equation as an update rule to obtain an approximation of the values of all the states. The following is the update rule used to improve the approximation from the $k$ th step to the $(k+1)$ th step.

v_{k+1}(s) = \sum_a \pi(a | s) \sum_{s',r} p(s', r | s, a)[r + \gamma v_k(s')]

Initially, we can set $v_0(s) = 0$ for all states $s$ , and then apply the above update rule to gradually compute the expected value. (This process is similar to how the Bellman-Ford algorithm gradually updates its estimates to find the path with minimum cost.) As $k$ approaches $\infty$ , we can guarantee convergence to the actual values as the effects of each state propagate to all other states. Convergence may occur earlier, at which point we can terminate early and move on to policy improvement. The following is the code for the policy evaluation algorithm for the FrozenLakeEnvironment that we set up in the previous article.

## Coordinate to ID Conversions
def num_to_coord(s):
  return [s // 4, s % 4]
 
def coord_to_num(coord):
  return coord[0] * 4 + coord[1]
 
  
## State-Action Transition (Env only sets up `step` that only take action, 
## but we need to be able to take action from any state for calculation.)
def transition_model(env, state, action):
  # If agent already has terminated, return the current state and 0 reward
  if state == 15:
    return [[1, state, 0, True]]
  if state in [5, 7, 11, 12]:
    return [[1, state, 0, True]]
 
  # Else, move and return the result
  state = num_to_coord(state)
  env._agent_location = state
  direction = env._action_to_direction[action]
 
  env._agent_location = np.clip(
      env._agent_location + direction, 0, 3
  )
  success = np.array_equal(env._agent_location, env._target_location)
  fail = any([np.array_equal(env._agent_location, location) for location in env._hole_locations])
  terminated = any([success, fail])
  reward = 1 if success else 0
  observation = env._agent_location
  observation = coord_to_num(observation)
 
  # if non deterministic, we would have [prob, observation, reward, terminated] for 
  # different observations. 
  return [[1, observation, reward, terminated]]
 
## Policy Evaluation
def policy_evaluation(env, policy, gamma=1, theta=1e-8, draw=False):
    V = np.zeros(16)
    while True:
        delta = 0 # value change 
        for s in range(16):
            Vs = 0 # state value
            for a, action_prob in enumerate(policy[s]):
                for prob, next_state, reward, done in transition_model(env, s, a):
                    Vs += action_prob * prob * (reward + gamma * V[next_state]) # update rule
            delta = max(delta, np.abs(V[s]-Vs)) # clip negative change
            V[s] = Vs # value update
        if draw: plot(V,policy,draw_vals=True)
        if delta < theta: # finish when value change is smaller than theta
            break
    return V
 
env = FrozenLakeEnvironment()
policy = np.ones([16, 4]) / 4 # Start by evaluating first policy
V = policy_evaluation(env, policy, draw=True)
# Resulting V of each cell from 0 to 15
# array([0.01393977, 0.01163091, 0.02095297, 0.01047648, 0.01624865,
#        0.        , 0.04075153, 0.        , 0.03480619, 0.08816993,
#        0.14205316, 0.        , 0.        , 0.17582037, 0.43929118,
#        0.        ])

As we need to apply the update rule to all states, we must first set up a transition_model, which allows us to perform an action from any given state. After applying policy evaluation, we can use the V, the approximation of the values of all the states under the current policy, to improve the policy.

Policy Improvement

The way in which we can improve the policy based on the value estimations while ensuring that the new policy cannot be worse than the previous one is to choose the action that maximizes the value of the next state or to construct a greedy policy. We can mathematically express this using the state-action value function $q_{\pi}(s, a)$ as follows.

\pi'(s) = \argmax_a q_{\pi}(s, a) \\ = \argmax_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_{\pi}(s')]

We can be sure that the new greedy policy cannot be worse because the value function under the current policy is accurately estimated. We can accomplish the policy improvement in code as shown below.

## Create Q from V (rowwise)
def q_from_v(env, V, s, gamma=1):
    q = np.zeros(4) # row of q corresponding to state s
    for a in range(4):
        for prob, next_state, reward, done in transition_model(env, s, a):
            q[a] += prob * (reward + gamma * V[next_state]) # Bellman equation
    return q
 
## Policy Improvement
def policy_improvement(env, V, gamma=1):
    policy = np.zeros([16, 4]) / 4 ## New policy
    for s in range(16):
        q = q_from_v(env, V, s, gamma)
 
        # deterministic policy, will always choose an action (not capture the distribution)
        # policy[s][np.argmax(q)] = 1
 
        # stochastic policy that puts equal probability on maximizing actions
        best_a = np.argwhere(q==np.max(q)).flatten()
        policy[s] = np.sum([np.eye(4)[i] for i in best_a], axis=0)/len(best_a)
 
    return policy
 
new_policy = policy_improvement(env, V)
# Resulting Policy
# array([[0.  , 0.  , 1.  , 0.  ],
#        [0.  , 0.  , 0.  , 1.  ],
#        [0.  , 0.  , 1.  , 0.  ],
#        [0.  , 1.  , 0.  , 0.  ],
#        [0.  , 0.  , 1.  , 0.  ],
#        :
#        [0.25, 0.25, 0.25, 0.25],
#        [0.25, 0.25, 0.25, 0.25],
#        [0.  , 0.  , 0.  , 1.  ],
#        [0.  , 0.  , 0.  , 1.  ],
#        [0.25, 0.25, 0.25, 0.25]])

Here, we established how to approximate the value of a policy, and how to improve the policy based on it. The goal is to reach the optimal policy corresponding to the Bellman optimality equation. To do so, we can simply run this cycle of policy evaluation and improvement iteratively until the values converge. The following is the code implementation of policy iteration.

def policy_iteration(env, gamma=1, theta=1e-8):
    policy = np.ones([16, 4]) / 4
    while True:
 
        # evaluate the policy (get the value function)
        V = policy_evaluation(env, policy, gamma, theta)
 
        # greedily choose the best action
        new_policy = policy_improvement(env, V)
 
        if np.max(abs(policy_evaluation(env, policy) - policy_evaluation(env, new_policy))) < theta*1e2:
           break
 
        policy = copy.copy(new_policy)
    return policy, V
 
policy, V = policy_iteration(env)

Under the assumption of a perfect model and unlimited computational resources, we can use policy iteration to solve an MDP.

Value Iteration

The problem with the policy iteration approach is that there is no limit to the number of iterations in policy evaluation. In a simple case, we can reach convergence quite fast with this approach, but we can only do one iteration in policy evaluation and still guarantee convergence, by the process called value iteration. The value iteration method can also be thought of as using Bellman optimality equation instead of the Bellman equation for the update rule, effectively combining policy evaluation and improvement.

v_{k+1}(s) = \argmax_a \sum_{s',r} p(s', r | s, a)[r + \gamma v_k(s')]

It intuitively makes sense to update the value with only the most valuable state since the optimal policy will choose the action to move to the state with the highest value anyway. The following shows the code implementation of value iteration.

def value_iteration(env, gamma=1, theta=1e-8):
    V = np.zeros(16)
    # Combines policy evaluation & improvement
    # Using Bellman optimality equation
    while True:
        delta = 0
        for s in range(16):
            v_s = V[s]
            q_s = q_from_v(env, V, s, gamma)
            V[s] = max(q_s) # only choosing the action that leads to max state-action value
            delta = max(delta, abs(V[s] - v_s))
        if delta < theta: break
    # Get the optimal policy at last
    policy = policy_improvement(env, V, gamma)
    return policy, V
 
policy, V = value_iteration(env)

The number of iterations in value iteration might increase due to smaller changes made in value estimation per iteration, and including a few more iterations of policy evaluation might lead to faster convergence. However, all approaches eventually converge to an optimal policy regardless.

Conclusion

In this article, we covered the dynamic programming approach to solving MDPs (Markov Decision Processes), policy iteration, and value iteration. The approach is characterized by the use of memorization to update estimates of a state from successive states, specifically called bootstrapping. Many approaches use bootstrapping techniques, even those that do not assume a perfect model. In the next article, we will cover one that does not use bootstrapping and assumes a perfect model, and in the article after that, we will cover one that uses bootstrapping but does not assume a perfect model.

Resources

cwkx. 2022. Reinforcement Learning 4: Dynamic programming. YouTube.
Sutton, S. R. & Barto, G. A. 2015. Reinforcement Learning: An Introduction. Stanford.