Road to ML Engineer #42 - Function Approximations

So far, in the reinforcement learning series, we have been working with methods that update the state-action values in a table or Q-table with unique update rules. However, the realistic reinforcement learning problem is quite complex, so that the table needs to be extremely large and does not fit in memory. To tackle this problem, we can move away from relying on tables by approximating the entries of the table with a function.

Function Approximations

Here, the task is to fit an arbitrary function $\hat{q}(s, a, w)$ to the table mapping the state-action pair and the corresponding state-action value for the optimal policy $q_{\pi}(s, a)$ . (Alternatively, we can map the state to the Q-values of each action (a row of Q-table), which makes the computation easier.) The function can either be linear with linear, polynomial, and trigonometric basis, or be non-linear functions like deep neural networks. Either way, we can fit a function using various optimizers, including stochastic gradient descent (SGD) with the least-square errors as follows.

w_{t+1} = w_t - \frac{1}{2}\alpha \frac{\partial}{\partial w} (q_{\pi}(S_t, A_t) - \hat{q}(S_t, A_t, w_t))^2

The estimate of $q_{\pi}(S_t, A_t)$ can be obtained using various methods we have discussed, such as the Monte Carlo method that uses $G_t$ and TD(0) that uses $R_t + \gamma \hat{q}(S_t+1, A_t+1; w)$ . This function approximation approach seems promising, especially for deep neural networks or deep Q-learning (DQN), as deep neural networks can theoretically map to virtually any non-linear functions with enough weights. However, it is not simple in practice.

Function Approximation Challenges

If you implement such deep neural networks for predicting Q-values based on state-action pairs, you will see that the DNNs struggle to converge to the optimal policy. This is due to the conflicts between the assumptions and premises of DNNs and MDPs. First, the estimate of the Q-values under the optimal policy are non-stationary, meaning that the values dynamically change as new rewards are explored, which contradicts the DNNs' assumption of stationary data. Secondly, the data from MDPs are highly correlated and not independent and identically distributed (i.i.d.), despite DNNs assuming the data to be i.i.d.

Thus, in practice, DNNs typically 'chatter' around a near-optimal value function or suddenly diverge once in a while due to explorations. Since we are not guaranteed to improve the policy every iteration, we lose the theoretical convergence guarantees for non-linear function approximations unlike the previously discussed methods.

Priority Experience Replay

Instead of the incremental approach like TD, which causes the data to be highly correlated, we can store many experiences in a buffer (called replay buffer) and sample the set of experiences (state, action, reward, and termination) from the buffer for training. This technique, called experience replay, allows us to technically say that we are sampling from the same distribution, making the learning more stable. However, we observe that the learning struggles when the rewards are mostly zero, as in FrozenLakeEnvironment. To sample more state-action pairs that can provide more learning opportunities, we can assign priorities to the pairs based on TD errors and sample the pairs using priorities as follows.

p = [r + \gamma Q(s', a') - Q(s, a)] + \epsilon \\ P(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}}

Here, $\epsilon$ is a small value used to ensure that data with small TD errors still have a chance to be sampled, $P(i)$ is the probability of choosing the $i$ th experience, and $\alpha$ is a hyperparameter that decides the sharpness of the distribution ( $\alpha=0$ results in uniform sampling). This technique of using priority for sampling is called priority experience replay (PER). However, choosing only experiences with high TD errors can make the model biased and prone to overfitting. Hence, we need to adjust the learning bias by multiplying the gradient with importance sampling weights as follows.

w_i = \bigg(\frac{1}{NP(i)}\bigg)^{\beta}

Here, $N$ is the replay buffer size, and $\beta$ is a hyperparameter that decides the degree of bias correction. We tend to correct for bias less at first to prioritize learning from experiences with high TD errors and gradually increase the level of bias correction to compensate for the bias after stable learning (by annealing $\beta$ from 0.4 to 1). The link provides an example implementation of PER, where the priorities of the earliest pushed experiences are initially set to a maximum value and updated when computing the TD loss. In this implementation, the priorities are computed using weighted TD loss instead of TD error, and $\epsilon$ for priority is $10^{-5}$ , not the decayed epsilon, which refers to the $\epsilon$ in the $\epsilon$ -greedy behavior policy for deep Q-learning.

Double Q-Learning

PER greatly improves the stability of training, but we can observe in the code implementation linked above that another strategy called double Q-learning is utilized for stabilizing learning more. The standard deep Q-learning uses the same network or weights for $\epsilon$ -greedy behavior policy and greedy target policy or selecting action and evaluation, which can cause bias towards overestimating the value of current state-action pairs (the behavior policy chooses the best action based on the criteria that the target policy also uses, meaning the target policy tends to agree with the behavior policy and evaluate the action as best or set the value high). Double Q-learning addresses this by having two separate networks for selecting actions and evaluation, offering fairer evaluation and stable learning.

\delta = r + \gamma Q_{\text{target}}(s', \argmax_{a'} Q_{\text{behavior}}(s', a')) - Q_{\text{behavior}}(s, a)

In practice, we use two networks with the same shape, where the target network is frozen for some number of iterations, whose weights can be updated to match those of the behavior network periodically. The implementation does this by using load_state_dict from PyTorch within update_target and running the function every 1000 iterations. It makes sense to take this approach, since we assume the existence of a true Q-table for an MDP, and using different networks is only for reducing the overestimation of Q-values and stabilizing learning to address an engineering issue.

Dueling Q-Learning

The dueling Q-learning addresses scenarios where the choices of actions in some states do not matter much to the rewards. For example, when all actions lead to roughly the same state with no additional reward, or the choice of actions does not affect the rewards. Even in these scenarios, standard Q-learning approximates the Q-values of every action at every state, which might be redundant. The dueling Q-learning addresses this by breaking down Q-values into state value and advantage as follows.

Q(s, a) = V(s) + A(s, a)

The advantage function $A(s, a)$ measures how much better an action $a$ is than the average action in the state $s$ . We can have separate networks approximating the value and the advantage, which allows us to learn the stable state values more efficiently and generally learn more stable and faster. To further stabilize learning, we can subtract the mean from the advantage function, since the advantage function can have arbitrary values. The implementation of dueling Q-learning with standard experience replay can be accessed via the link here.

Conclusion

In this article, we discussed the idea of function approximation in reinforcement learning, the challenges of the approach, and some tricks for tackling these challenges. Those techniques can be combined together to make the learning process as stable and efficient as possible for various scenarios, and there are many other engineering techniques that we have not covered in this article which can be combined to slightly ease the learning process; I recommend checking them out.

Resources

cwkx. 2022. Reinforcement Learning 7: Function approximation. YouTube.
higgsfield. 2018. DQN Adventure: from Zero to State of the Art. GitHub.
Sutton, S. R. & Barto, G. A. 2015. Reinforcement Learning: An Introduction. Carnegie Mellon University.