The blog post introduces the Python package Gymnasium, used for reinforcement learning.

In the last article, we introduced the Markov decision process, its mathematical representations, and the quantities that can be derived from it. In this article, we will discuss how to create an environment for reinforcement learning tasks in Python.
Gymnasium
Gymnasium, previously known as OpenAI Gym, is a Python library that offers a standardized API for creating,
inspecting, and interacting with environments. Gymnasium revolves around the Env
class,
which defines action and state spaces as the action_space
and observation_space
attributes of the Space
class.
The following demonstrates the attributes of an example environment, Frozen Lake,
where the agent moves in four directions to navigate a 4×4 grid world.
import gymnasium as gym
## Obtain example Env object
name = 'FrozenLake-v1'
env = gym.make(name, is_slippery=False) # deterministic movements by is_slippery=False, meaning state action transition probability is always 1.
## Space objects as attributes of Env object
env.action_space # => Discrete(4) corresponding to LEFT, DOWN, RIGHT, UP
env.observation_space # => Discrete(16) corresponding to ids of boxes in 4 by 4 grid the agent is in
The Discrete
class is an instance of the Space
class that creates a list of integers.
There are many other instances of the Space
class, such as Box
, MultiBinary
, and MultiDiscrete
,
which will be covered later. An environment needs implementations of the reset
and step
methods,
which reset the environment to its initial configuration and perform one step (i.e., a state transition)
given the state and action. State-transition probabilities and rewards are often computed algorithmically
within the step
method. The following example demonstrates the reset
and step
methods using the Frozen Lake environment.
# Observation Space
# 0 1 2 3
# 4 5 6 7
# 8 9 10 11
# 12 13 14 15
# The indices 5, 7, 11, and 12 are holes. If the agent reaches one of them,
# it terminates with 0 reward.
# The goal is to reach 15 from 0. Once the agent reaches 15, it terminates
# with reward 1.
LEFT, DOWN, RIGHT, UP = 0, 1, 2, 3
env.reset() # observation (agent state) = 0
observation, reward, terminated, truncated, info = env.step(DOWN) # moves to 4
print(observation, reward, terminated, truncated) # => 4, 0, False, False
print(reward) # => 0
print(info) # => {'prob': 1.0} <- transition probability
observation, reward, terminated, truncated, info = env.step(RIGHT) # moves to 5
print(observation, reward, terminated, truncated) # => 5, 0, True (terminated), False
print(reward) # => 0
print(info) # => {'prob': 1.0} <- transition probability
The reset
method places the agent at the 0th index and resets all parameters to their default values.
The step
method returns observation
, reward
, terminated
, truncated
, and info
.
The truncated
output is used to prevent the agent from running indefinitely.
We can build an algorithm that interacts with the data provided by the step
method
and the action space of the environment.
Agent
There are many ways to build an algorithm that interacts with the environment.
One approach is to create an Agent
class, as follows:
class RandomAgent():
def __init__(self, env):
self.env = env
self.policy = np.ones([env.observation_space.n, env.action_space.n]) / env.action_space.n
def reset(self):
observation, info = self.env.reset()
return observation, info
def act(self, observation):
# we can do `action = env.action_space.sample()` to do the same thing
action = np.random.choice(np.arange(4), p=self.policy[observation])
observation, reward, terminated, truncated, info = self.env.step(action)
return action, observation, reward, terminated, truncated, info
The RandomAgent
takes an environment as an attribute and generates a policy that stores the probabilities of taking a certain action given a state.
The policy here is set to be random by having uniform distributions at all states. The reset
method uses the env.reset
method,
and the act
method selects an action based on the policy and observation, then executes it using the env.step
method.
The agent can experience multiple episodes as follows:
episodes = 10
expected_reward = 0
agent = RandomAgent(env)
for episode in range(episodes):
print(f'Episode {episode}')
observation, info = agent.reset()
terminated = False
truncated = False
actions = []
while not terminated and not truncated:
action, observation, reward, terminated, truncated, info = agent.act(observation)
actions.append(action)
print(f' Actions: {actions}')
print(f' Terminated: {terminated}')
print(f' Truncated: {truncated}')
print(f' Final Observation: {observation}')
print(f' Final Reward : {reward}')
expected_reward += reward
expected_reward /= episodes
print(f'Expected reward: {expected_reward}')
When running the above code, we can see that the agent almost always falls into holes and never reaches the goal.
The goal of reinforcement learning is to implement an update
or train
method in the agent that can learn to
adjust its policy to achieve higher rewards.
Spaces
There are multiple instances of the Space
class that can be used in various scenarios. For example,
the Discrete
space is the simplest one, defining a space consisting of integers.
This class can be used for virtually any discrete space, though some spaces might be better
represented using other classes. For example, we can use MultiBinary
to represent the on/off states of switches,
possibly in different machines.
# Discrete
discrete = Discrete(2,start=-2,seed=42)
discrete.saple() # => np.int64(-1)
# MultiBinary
multibinary = MultiBinary(3, seed=42)
multibinary.sample() # => np.array([1, 0, 1])
multibinary = MultiBinary([3, 2], seed=42)
multibinary.sample() # => np.array([[1, 0], [0, 0], [0, 1]])
The above code demonstrates the Discrete
and MultiBinary
spaces. All instances of the Space
class
share the sample
method, which can generate a sample state within the space. The MultiBinary
class
can accept both an integer and a list to define the dimensions of binary arrays. The MultiDiscrete
class
is useful when expressing a space that combines multiple discrete spaces, such as a configuration involving
arrow keys and number keys.
# MutliDiscrete
multidiscrete = MultiDiscrete([5, 11], seed=42) # 4 arrow keys, and 10 number keys (0 for no key pressed)
multidiscrete.sample() # => np.array([3, 8]) <- UP arrow key and 8 pressed.
The Box
class is possibly the most flexible of all the Space
classes we have covered so far.
It can represent arrays of any size, with any bounds and any data type. The following example demonstrates
how to use the Box
class.
# Box (same bounds for all values)
box = Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
box.sample()
# array([[ 0.94306654, 0.3153641 , 0.23421921, -0.9436297 ],
# [-0.7621383 , 1.14396 , -0.48780602, 0.9031065 ],
# [ 0.8333457 , -0.67683184, -0.3102487 , 0.04513068]],
# dtype=float32)
# Box (different bounds for different values)
box = Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), shape=(2,), dtype=np.int32)
box.sample()
# array([-1, 3], dtype=int32)
Beyond these fundamental spaces, there are composite spaces that combine multiple fundamental spaces,
as Dict
, Tuple
, Sequence
, and Graph
. If you are interested, I recommend checking the official
documentation of Gymnasium, which is cited at the bottom of this article.
Custom Environment
We can create our own environment using the Env
and Space
classes we have discussed.
For example, we can use the Box
space instead of the Discrete
space to represent the state space
when replicating the Frozen Lake environment, as shown below.
class FrozenLakeEnvironment(Env):
def __init__():
# State Space
self.observation_space = Box(0, 3, shape=(2,), dtype=int)
# Action Space
self.action_space = Discrete(4)
self._action_to_direction = {
0: np.array([-1, 0]), #LEFT
1: np.array([0, -1]), #DOWN
2: np.array([1, 0]), #RIGHT
3: np.array([0, 1]), #UP
}
# Agent's & Target's Locations
self._agent_location = np.array([0, 0])
self._target_location = np.array([3, 3])
# Holes' Locations
self._hole_locations = np.array([[1, 1], [1, 3], [2, 3], [3, 0]])
# info (always prob=1 for deterministic environment)
self._info = {'prob': 1.0 }
def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
super().reset(seed=seed)
self._agent_location = np.array([0, 0])
observation = self._agent_location
info = self._info
return observation, info
def step(self, action):
direction = self._action_to_direction[action]
# Make sure agent doesn't leave the grid by clipping
self._agent_location = np.clip(
self._agent_location + direction, 0, 3
)
success = np.array_equal(self._agent_location, self._target_location)
fail = any([np.array_equal(self._agent_location, location) for location in self._hole_locations])
terminated = any([success, fail])
truncated = False
reward = 1 if success else 0
observation = self._agent_location
info = self._info
return observation, reward, terminated, truncated, info
Since we are dealing with 2D coordinates, we need to use action_to_direction
to translate discrete actions into coordinate directions
and np.array_equal
to compare the agent's, target's, and holes' locations. In addition to the reset and step methods,
we can add a render
method to display the current state, as shown below.
def render(self):
map = np.array([["-" for _ in range(4)] for _ in range(4)])
map[self._agent_location[0], self._agent_location[1]] = 'A'
map[self._target_location[0], self._target_location[1]] = 'G'
map[self._hole_locations[:, 0], self._hole_locations[:, 1]] = 'H'
# If agent has reached the goal, set the value to T for termination
if np.array_equal(self._agent_location, self._target_location):
map[self._agent_location[0], self._agent_location[1]] = 'T'
# If agent falls in a hole, set the value to F for failure
for hole_location in self._hole_locations:
if np.array_equal(self._agent_location, hole_location):
map[self._agent_location[0], self._agent_location[1]] = 'F'
print(map)
The above code creates a 4×4 grid with appropriate letters placed at the agent, target, and hole locations. This approach allows for better visualization of the agent’s observations, as demonstrated below.
env = FrozenLakeEnvironment()
episodes = 10
expected_reward = 0
for episode in range(episodes):
print(f'Episode {episode}')
observation, info = env.reset()
terminated = False
truncated = False
actions = []
while not terminated and not truncated:
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
actions.append(action)
print(f' Final Reward : {reward}')
env.render()
expected_reward += reward
expected_reward /= episodes
print(f'Expected reward: {expected_reward}')
## Eample output of env.render()
# [['-' '-' '-' '-']
# ['-' 'F' '-' 'H']
# ['-' '-' '-' 'H']
# ['H' '-' '-' 'G']]
Instead of printing out the index of a discrete space, which is difficult to interpret,
we can render the Box space using a NumPy array for clearer visualization. Typically,
predefined environments come with a corresponding render method that can visualize the state in various formats,
which can be selected by passing the render_mode
argument.
Conclusion
In this article, we introduced Gymnasium, which provides standardized methods for implementing environments
and spaces for reinforcement learning tasks. We explored a predefined environment, built an agent around it,
and set up a custom environment with improved rendering. Even if we are not using the Gymnasium library
and instead building the environment from scratch, following Gymnasium’s standardized structure and methods,
such as reset
and step
, is essential for readability, flexibility, and scalability.Now that we have a clear understanding of
how reinforcement learning tasks can be represented both mathematically and programmatically, we will move on
to discussing algorithms that improve rewards beyond random action selection.
Resources
- Gymnasium. N/A. An API standard for reinforcement learning with a diverse collection of reference environments. Gymnasium Documentation.