Road to ML Engineer #38 - Gymnasium

Last Edited: 1/30/2025

The blog post introduces the Python package Gymnasium, used for reinforcement learning.

ML

In the last article, we introduced the Markov decision process, its mathematical representations, and the quantities that can be derived from it. In this article, we will discuss how to create an environment for reinforcement learning tasks in Python.

Gymnasium

Gymnasium, previously known as OpenAI Gym, is a Python library that offers a standardized API for creating, inspecting, and interacting with environments. Gymnasium revolves around the Env class, which defines action and state spaces as the action_space and observation_space attributes of the Space class. The following demonstrates the attributes of an example environment, Frozen Lake, where the agent moves in four directions to navigate a 4×4 grid world.

import gymnasium as gym
 
## Obtain example Env object
name = 'FrozenLake-v1'
env = gym.make(name, is_slippery=False) # deterministic movements by is_slippery=False, meaning state action transition probability is always 1.
 
## Space objects as attributes of Env object
env.action_space # => Discrete(4) corresponding to LEFT, DOWN, RIGHT, UP
env.observation_space # => Discrete(16) corresponding to ids of boxes in 4 by 4 grid the agent is in

The Discrete class is an instance of the Space class that creates a list of integers. There are many other instances of the Space class, such as Box, MultiBinary, and MultiDiscrete, which will be covered later. An environment needs implementations of the reset and step methods, which reset the environment to its initial configuration and perform one step (i.e., a state transition) given the state and action. State-transition probabilities and rewards are often computed algorithmically within the step method. The following example demonstrates the reset and step methods using the Frozen Lake environment.

# Observation Space
#  0   1   2   3
#  4   5   6   7
#  8   9  10  11
# 12  13  14  15
# The indices 5, 7, 11, and 12 are holes. If the agent reaches one of them, 
# it terminates with 0 reward.
# The goal is to reach 15 from 0. Once the agent reaches 15, it terminates
# with reward 1.
 
LEFT, DOWN, RIGHT, UP = 0, 1, 2, 3
 
env.reset() # observation (agent state) = 0
observation, reward, terminated, truncated, info = env.step(DOWN) # moves to 4
print(observation, reward, terminated, truncated) # => 4, 0, False, False
print(reward) # => 0
print(info) # => {'prob': 1.0} <- transition probability
 
observation, reward, terminated, truncated, info = env.step(RIGHT) # moves to 5
print(observation, reward, terminated, truncated) # => 5, 0, True (terminated), False
print(reward) # => 0
print(info) # => {'prob': 1.0} <- transition probability

The reset method places the agent at the 0th index and resets all parameters to their default values. The step method returns observation, reward, terminated, truncated, and info. The truncated output is used to prevent the agent from running indefinitely. We can build an algorithm that interacts with the data provided by the step method and the action space of the environment.

Agent

There are many ways to build an algorithm that interacts with the environment. One approach is to create an Agent class, as follows:

class RandomAgent():
  def __init__(self, env):
    self.env = env
    self.policy = np.ones([env.observation_space.n, env.action_space.n]) / env.action_space.n
 
  def reset(self):
    observation, info = self.env.reset()
    return observation, info
 
  def act(self, observation):
    # we can do `action = env.action_space.sample()` to do the same thing
    action = np.random.choice(np.arange(4), p=self.policy[observation])
    observation, reward, terminated, truncated, info = self.env.step(action)
    return action, observation, reward, terminated, truncated, info

The RandomAgent takes an environment as an attribute and generates a policy that stores the probabilities of taking a certain action given a state. The policy here is set to be random by having uniform distributions at all states. The reset method uses the env.reset method, and the act method selects an action based on the policy and observation, then executes it using the env.step method. The agent can experience multiple episodes as follows:

episodes = 10
expected_reward = 0
 
agent = RandomAgent(env)
 
for episode in range(episodes):
  print(f'Episode {episode}')
 
  observation, info = agent.reset()
  terminated = False
  truncated = False
  actions = []
  while not terminated and not truncated:
    action, observation, reward, terminated, truncated, info = agent.act(observation)
    actions.append(action)
  print(f'  Actions: {actions}')
  print(f'  Terminated: {terminated}')
  print(f'  Truncated: {truncated}')
  print(f'  Final Observation: {observation}')
  print(f'  Final Reward : {reward}')
  expected_reward += reward
 
expected_reward /= episodes
print(f'Expected reward: {expected_reward}')

When running the above code, we can see that the agent almost always falls into holes and never reaches the goal. The goal of reinforcement learning is to implement an update or train method in the agent that can learn to adjust its policy to achieve higher rewards.

Spaces

There are multiple instances of the Space class that can be used in various scenarios. For example, the Discrete space is the simplest one, defining a space consisting of nn integers. This class can be used for virtually any discrete space, though some spaces might be better represented using other classes. For example, we can use MultiBinary to represent the on/off states of switches, possibly in different machines.

# Discrete
discrete = Discrete(2,start=-2,seed=42)
discrete.saple() # => np.int64(-1)
 
# MultiBinary
multibinary = MultiBinary(3, seed=42)
multibinary.sample() # => np.array([1, 0, 1])
 
multibinary = MultiBinary([3, 2], seed=42)
multibinary.sample() # => np.array([[1, 0], [0, 0], [0, 1]])

The above code demonstrates the Discrete and MultiBinary spaces. All instances of the Space class share the sample method, which can generate a sample state within the space. The MultiBinary class can accept both an integer and a list to define the dimensions of binary arrays. The MultiDiscrete class is useful when expressing a space that combines multiple discrete spaces, such as a configuration involving arrow keys and number keys.

# MutliDiscrete
multidiscrete = MultiDiscrete([5, 11], seed=42) # 4 arrow keys, and 10 number keys (0 for no key pressed)
multidiscrete.sample() # => np.array([3, 8]) <- UP arrow key and 8 pressed. 

The Box class is possibly the most flexible of all the Space classes we have covered so far. It can represent arrays of any size, with any bounds and any data type. The following example demonstrates how to use the Box class.

# Box (same bounds for all values)
box = Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
box.sample()
# array([[ 0.94306654,  0.3153641 ,  0.23421921, -0.9436297 ],
#        [-0.7621383 ,  1.14396   , -0.48780602,  0.9031065 ],
#        [ 0.8333457 , -0.67683184, -0.3102487 ,  0.04513068]],
#       dtype=float32)
 
# Box (different bounds for different values)
box = Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), shape=(2,), dtype=np.int32)
box.sample()
# array([-1,  3], dtype=int32)

Beyond these fundamental spaces, there are composite spaces that combine multiple fundamental spaces, as Dict, Tuple, Sequence, and Graph. If you are interested, I recommend checking the official documentation of Gymnasium, which is cited at the bottom of this article.

Custom Environment

We can create our own environment using the Env and Space classes we have discussed. For example, we can use the Box space instead of the Discrete space to represent the state space when replicating the Frozen Lake environment, as shown below.

class FrozenLakeEnvironment(Env):
    def __init__():
        # State Space
        self.observation_space = Box(0, 3, shape=(2,), dtype=int)
        # Action Space
        self.action_space = Discrete(4)
        self._action_to_direction = {
            0: np.array([-1, 0]), #LEFT
            1: np.array([0, -1]), #DOWN
            2: np.array([1, 0]),  #RIGHT
            3: np.array([0, 1]),  #UP
        }
 
        # Agent's & Target's Locations
        self._agent_location = np.array([0, 0])
        self._target_location = np.array([3, 3])
 
        # Holes' Locations
        self._hole_locations = np.array([[1, 1], [1, 3], [2, 3], [3, 0]])
 
        # info (always prob=1 for deterministic environment)
        self._info = {'prob': 1.0 }
 
    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
        super().reset(seed=seed)
        self._agent_location = np.array([0, 0])
        observation = self._agent_location
        info = self._info
        return observation, info
 
    def step(self, action):
        direction = self._action_to_direction[action]
        # Make sure agent doesn't leave the grid by clipping
        self._agent_location = np.clip(
            self._agent_location + direction, 0, 3
        )
        success = np.array_equal(self._agent_location, self._target_location)
        fail = any([np.array_equal(self._agent_location, location) for location in self._hole_locations])
        terminated = any([success, fail])
        truncated = False
        reward = 1 if success else 0
        observation = self._agent_location
        info = self._info
        return observation, reward, terminated, truncated, info

Since we are dealing with 2D coordinates, we need to use action_to_direction to translate discrete actions into coordinate directions and np.array_equal to compare the agent's, target's, and holes' locations. In addition to the reset and step methods, we can add a render method to display the current state, as shown below.

def render(self):
    map = np.array([["-" for _ in range(4)] for _ in range(4)])
    map[self._agent_location[0], self._agent_location[1]] = 'A'
    map[self._target_location[0], self._target_location[1]] = 'G'
    map[self._hole_locations[:, 0], self._hole_locations[:, 1]] = 'H'
 
    # If agent has reached the goal, set the value to T for termination
    if np.array_equal(self._agent_location, self._target_location):
        map[self._agent_location[0], self._agent_location[1]] = 'T'
 
    # If agent falls in a hole, set the value to F for failure
    for hole_location in self._hole_locations:
        if np.array_equal(self._agent_location, hole_location):
            map[self._agent_location[0], self._agent_location[1]] = 'F'
    print(map)

The above code creates a 4×4 grid with appropriate letters placed at the agent, target, and hole locations. This approach allows for better visualization of the agent’s observations, as demonstrated below.

env = FrozenLakeEnvironment()
 
episodes = 10
expected_reward = 0
 
for episode in range(episodes):
  print(f'Episode {episode}')
 
  observation, info = env.reset()
  terminated = False
  truncated = False
  actions = []
  while not terminated and not truncated:
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    actions.append(action)
  print(f'  Final Reward : {reward}')
  env.render()
  expected_reward += reward
 
expected_reward /= episodes
print(f'Expected reward: {expected_reward}')
 
## Eample output of env.render()
# [['-' '-' '-' '-']
#  ['-' 'F' '-' 'H']
#  ['-' '-' '-' 'H']
#  ['H' '-' '-' 'G']]

Instead of printing out the index of a discrete space, which is difficult to interpret, we can render the Box space using a NumPy array for clearer visualization. Typically, predefined environments come with a corresponding render method that can visualize the state in various formats, which can be selected by passing the render_mode argument.

Conclusion

In this article, we introduced Gymnasium, which provides standardized methods for implementing environments and spaces for reinforcement learning tasks. We explored a predefined environment, built an agent around it, and set up a custom environment with improved rendering. Even if we are not using the Gymnasium library and instead building the environment from scratch, following Gymnasium’s standardized structure and methods, such as reset and step, is essential for readability, flexibility, and scalability.Now that we have a clear understanding of how reinforcement learning tasks can be represented both mathematically and programmatically, we will move on to discussing algorithms that improve rewards beyond random action selection.

Resources