9.2. Creating a new RL Example in OmniIsaacGymEnvs
A collection of reinforcement learning environments are provided at OmniIsaacGymEnvs, which can be used as examples and baselines to build new RL environments with Isaac Sim. In this tutorial, we will dive into the details of the tasking framework for running a RL environment.
9.2.1. Learning Objectives
We will walk through a simple example in OmniIsaacGymEnvs (OIGE) and learn how to set up a new example using the OIGE framework. We will
Introduce the tasking framework
Implement a Cartpole task
Register the task to the OIGE framework
Understand config files for the task
Run the Cartpole task
20-30 Minute Tutorial
9.2.2. Getting Started
Refer to Default Python Environment to learn about Isaac Sim’s python environment and locate the python executable in Isaac Sim.
Refer to Workstation Installation for setting up the Nucleus server for access to pre-trained checkpoints.
Refer to Overview & Getting Started for setting up OmniIsaacGymEnvs.
9.2.3. Introduction to the Tasking Framework
The main entry script for OmniIsaacGymEnvs is rlgames_train.py
, which can be found here. This script parses input parameters using the Hydra Framework, which is a framework used to handle configuration parameters. All tasks in OIGE define config files in .yaml
files, which are then parsed using Hydra.
The script is also responsible for initializing our Environment object. In Isaac Sim, we define a VecEnvBase
class in the omni.isaac.gym
extension as an interface for connecting RL libraries with the simulation in Isaac Sim and our task. In OIGE, we extend the VecEnvBase
class for the rlgames RL library, which is a vectorized GPU-based RL library we use for training in OIGE. This new environment object, VecEnvRLGames, is responsible for passing actions from the RL library to our task, stepping simulation, and retrieving observations, rewards, and resets buffers for the RL library.
When launching a task in OIGE, we use the following command:
PYTHON_PATH scripts/rlgames_train.py task=Cartpole
The rlgames_train.py
script parses an important parameter from the command, task=Cartpole. This will in turn initialize an instance of the Cartpole
task using a utility defined in task_util.py. The task initialized will be explained in the next section.
In addition to the task implementation, each task in OIGE is also paired with 2 config files: one for the task parameters and another for the training parameters. We will walk through the config files in details in the later sections.
9.2.4. Task Implementation Walk-through - Cartpole
In this section, we will look at one of the simple examples in OmniIsaacGymEnvs, the Cartpole environment. We will look at each section of the Cartpole task and understand the various APIs used for defining the task. Code for the Cartpole environment is available here. The goal of this task is to balance the pole of the robot to stay in an upright position by gently sliding the cart along the rail.
The first piece of creating a new RL environment is to define a new class for the task. This task class will be responsible for performing resets, applying actions, collecting observations and computing rewards. In this example, we will define a class called CartpoleTask to implement our Cartpole environment.
The class takes in 3 required arguments and an optional offset
argument, as shown in the __init__
method below.
class CartpoleTask(RLTask):
def __init__(self, name, sim_config, env, offset=None) -> None:
The first argument is the name of the class, which will be parsed from the task config file. The second argument is a SimConfig object, which contains task and physics parameters parsed from the task config file. The SimConfig
object will generally contain task information, such as the number of environments used for the task, and physics parameters, including simulation dt, GPU buffer dimensions, and physics properties for individual assets used in the task. The third argument is the environment object, which in our case, will be a VecEnvRLGames
object defined by the rlgames_train.py
script. All of these parameters will be parsed and processed automatically by the task_util.py
script, which is called when a training run is launched from command line.
In our constructor method, we will intialize a few key variables. self._num_observations
and self._num_actions
are required variables that must be set by the task implementation. For our Cartpole
task, we will use 4 observations to represent the joint positions and velocities for the cart and pole joints, and 1 action to apply as the force to the cart.
class CartpoleTask(RLTask):
def __init__(self, name, sim_config, env, offset=None) -> None:
self.update_config(sim_config)
self._max_episode_length = 500
# these must be defined in the task class
self._num_observations = 4
self._num_actions = 1
# call the parent class constructor to initialize key RL variables
RLTask.__init__(self, name, env)
In addition, we call the base RLTask
class constructor to initialize some key variables that are common to all task classes. These variables include setups for domain randomization, defining action and observation spaces if not specified by the task class, initializing the Cloner for environment set up, and defining the reward, obervation, and reset buffers for our task.
Notice in the first line of the constructor, we called self.update_config(sim_config)
, which is a utility method for parsing the SimConfig
object, as defined below.
def update_config(self, sim_config):
# extract task config from main config dictionary
self._sim_config = sim_config
self._cfg = sim_config.config
self._task_cfg = sim_config.task_config
# parse task config parameters
self._num_envs = self._task_cfg["env"]["numEnvs"]
self._env_spacing = self._task_cfg["env"]["envSpacing"]
self._cartpole_positions = torch.tensor([0.0, 0.0, 2.0])
# reset and actions related variables
self._reset_dist = self._task_cfg["env"]["resetDist"]
self._max_push_effort = self._task_cfg["env"]["maxEffort"]
In this method, we can extract the task config dictionary from the main config dictionary, allowing for easier access to task-related parameters, such as number of environments, spacing between environments, the distance that should trigger a reset, and the effort scaling we should apply to our actions.
Next, we will define our methods for setting up our simulation world. This is where we will define our robot asset, apply physics parameters to the robot, and clone it multiple times to build our vectorized environment.
def set_up_scene(self, scene) -> None:
# first create a single environment
self.get_cartpole()
# call the parent class to clone the single environment
super().set_up_scene(scene)
# construct an ArticulationView object to hold our collection of environments
self._cartpoles = ArticulationView(
prim_paths_expr="/World/envs/.*/Cartpole", name="cartpole_view", reset_xform_properties=False
)
# register the ArticulationView object to the world, so that it can be initialized
scene.add(self._cartpoles)
def get_cartpole(self):
# add a single robot to the stage
cartpole = Cartpole(
prim_path=self.default_zero_env_path + "/Cartpole", name="Cartpole", translation=self._cartpole_positions
)
# applies articulation settings from the task configuration yaml file
self._sim_config.apply_articulation_settings(
"Cartpole", get_prim_at_path(cartpole.prim_path), self._sim_config.parse_actor_config("Cartpole")
)
Note that each task class should define a set_up_scene
method, as this will be triggered automatically by the Isaac Sim framework when initializing the world. In this method, we first construct a single environment (a single Cartpole robot), by calling get_cartpole()
. This method creates an instance of the Cartpole
robot class, which simply adds a Cartpole asset into the scene. Then, we use the SimConfig utilities to apply physics settings to the robot.
Once the single environment has been constructed, we call the RLTask super().set_up_scene(scene)
method to further construct our stage. RLTask implements common utilities that are used by all tasks defined in OIGE. In set_up_scene
, RLTask applies the Cloner to duplicate the single environment based on the number of environments
variable defined in the task. It also processes settings from the config file to add a ground plane and additional light prims to the stage. In addition, collision filtering is applied to all clones of the envrionments to avoid interference across environments.
After the environment setup is complete, we can then define our post_reset
method, which will also be automatically triggered by the Isaac Sim framework at the very beginning of simulation.
def post_reset(self):
# retrieve cart and pole joint indices
self._cart_dof_idx = self._cartpoles.get_dof_index("cartJoint")
self._pole_dof_idx = self._cartpoles.get_dof_index("poleJoint")
# randomize all envs
indices = torch.arange(self._cartpoles.count, dtype=torch.int64, device=self._device)
self.reset_idx(indices)
In this method, we perform initializations with APIs that require simulation to be running, such as the vectorized tensor APIs available in ArticulationView. Here, we retrieve the joint indices of the cart and pole joints to be used when computing the observations and rewards. Then, we perform an initial reset for all of the environments to initialize them into a starting state for a new episode of RL training.
Below, we show the reset method for the Cartpole task. This method takes in one argument, env_ids
, which contains a tensor holding the indices of the environments that should be reset. In this task, we reset the joint positions and velocities of the Cartpole robot to a randomized state to ensure that the RL policy can learn to perform the task from arbitrary starting states.
def reset_idx(self, env_ids):
num_resets = len(env_ids)
# randomize DOF positions
dof_pos = torch.zeros((num_resets, self._cartpoles.num_dof), device=self._device)
dof_pos[:, self._cart_dof_idx] = 1.0 * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))
dof_pos[:, self._pole_dof_idx] = 0.125 * math.pi * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))
# randomize DOF velocities
dof_vel = torch.zeros((num_resets, self._cartpoles.num_dof), device=self._device)
dof_vel[:, self._cart_dof_idx] = 0.5 * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))
dof_vel[:, self._pole_dof_idx] = 0.25 * math.pi * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))
# apply randomized joint positions and velocities to environments
indices = env_ids.to(dtype=torch.int32)
self._cartpoles.set_joint_positions(dof_pos, indices=indices)
self._cartpoles.set_joint_velocities(dof_vel, indices=indices)
# reset the reset buffer and progress buffer after applying reset
self.reset_buf[env_ids] = 0
self.progress_buf[env_ids] = 0
Note that both dof_pos
and dof_vel
tensors are initialized with dimensions covering all joints in the robot. This is required by the ArticulationView APIs set_joint_positions
and set_joint_velocities
, which are used to apply the reset states for the robots. However, we can provide an additional indices
argument to the APIs to specify the environment indices that need to be reset. Values for indices that do not need to be reset can be left as zeros. Upon resetting the states of the environments, we also reset the reset buffer and progress buffer to zeros for the corresponding environments. The reset buffer reset_buf
is a boolean tensor used to track which environments should be reset at each step and the progress buffer progress_buf
is an integer tensor that tracks the number of steps each environment has simulated through since it was last reset.
At each step, before stepping the simulator, we should apply the actions from the RL policy to the robots so that they can be applied during the next physics step. Similarly, we should also reset any environments that are marked for reset by the reset_buf
such that these environments will be in a fresh state after we step simulation. This logic is implemented in the pre_physics_step
call, which takes in the actions
from the RL policy as a parameter, and is called prior to stepping simulation each step.
def pre_physics_step(self, actions) -> None:
# make sure simulation has not been stopped from the UI
if not self._env._world.is_playing():
return
# extract environment indices that need reset and reset them
reset_env_ids = self.reset_buf.nonzero(as_tuple=False).squeeze(-1)
if len(reset_env_ids) > 0:
self.reset_idx(reset_env_ids)
# make sure actions buffer is on the same device as the simulation
actions = actions.to(self._device)
# compute forces from the actions
forces = torch.zeros((self._cartpoles.count, self._cartpoles.num_dof), dtype=torch.float32, device=self._device)
forces[:, self._cart_dof_idx] = self._max_push_effort * actions[:, 0]
# apply actions to all of the environments
indices = torch.arange(self._cartpoles.count, dtype=torch.int32, device=self._device)
self._cartpoles.set_joint_efforts(forces, indices=indices)
In pre_physics_step
, we first check whether there are any environments that are marked for reset. If so, we reset these environments by calling the reset_idx
method explained above. Then, we process the actions received from the RL policy and translate the buffer into forces by applying a scaling factor. Note that similar to the joint state APIs, we also need to initialize a forces
tensor that covers the full dimensions for all joints in the robot. However, the RL policy only provides a single action for the cart joint, which we assign to our forces
buffer accordingly. Finally, we apply the forces to the robot using the set_joint_efforts
API from ArticulationView.
Next, we will implement our observations method, get_observations
. This method extracts states from simulation and populates the observation buffer obs_buf
with joint posiitons and velocities for the cart and pole joints. This information is sufficient for the RL policy to learn to the perform the task. At the end of the method, we construct a dictionary containing the observation buffer and return it, following the API definition of the get_observations
method in Isaac Sim framework’s BaseTask class.
def get_observations(self) -> dict:
# retrieve joint positions and velocities
dof_pos = self._cartpoles.get_joint_positions(clone=False)
dof_vel = self._cartpoles.get_joint_velocities(clone=False)
# extract joint states for the cart and pole joints
cart_pos = dof_pos[:, self._cart_dof_idx]
cart_vel = dof_vel[:, self._cart_dof_idx]
pole_pos = dof_pos[:, self._pole_dof_idx]
pole_vel = dof_vel[:, self._pole_dof_idx]
# populate the observations buffer
self.obs_buf[:, 0] = cart_pos
self.obs_buf[:, 1] = cart_vel
self.obs_buf[:, 2] = pole_pos
self.obs_buf[:, 3] = pole_vel
# construct the observations dictionary and return
observations = {self._cartpoles.name: {"obs_buf": self.obs_buf}}
return observations
From the observations, we can then also compute the reward for the task. The reward function is defined in the calculate_metrics
API, also following the interface from BaseTask
. In this method, we calculate the reward based on the angle of the pole. The closer it is to 0, meaning it is in an upright position, the higher the reward will be. In addition, we add small penalty terms to penalize the cart and pole from moving too fast. This is implemented by taking the absolute values of the cart and pole joint velocities. These terms will help ensure that actions from the RL policy provide smooth movements for the robots. We also provide a larger penalty of -2 for when the robot reaches a bad state, such as for moving beyond the reset distance limit and if the pole goes beyond 90 degrees in either direction. Finally, we assign the computed reward to the reward buffer, rew_buf
, which will then be passed to the RL framework.
def calculate_metrics(self) -> None:
# use states from the observation buffer to compute reward
cart_pos = self.obs_buf[:, 0]
cart_vel = self.obs_buf[:, 1]
pole_angle = self.obs_buf[:, 2]
pole_vel = self.obs_buf[:, 3]
# define the reward function based on pole angle and robot velocities
reward = 1.0 - pole_angle * pole_angle - 0.01 * torch.abs(cart_vel) - 0.5 * torch.abs(pole_vel)
# penalize the policy if the cart moves too far on the rail
reward = torch.where(torch.abs(cart_pos) > self._reset_dist, torch.ones_like(reward) * -2.0, reward)
# penalize the policy if the pole moves beyond 90 degrees
reward = torch.where(torch.abs(pole_angle) > np.pi / 2, torch.ones_like(reward) * -2.0, reward)
# assign rewards to the reward buffer
self.rew_buf[:] = reward
Lastly, we have a method for determining which environments need to be reset, which is implemented in is_done
. We placed multiple conditions for when an environment is due for reset. The first condition is when the robot moves beyond the preset limit of the reset distance. This means the robot has moved too far across the rail and has reached the end of the rail. In such a case, we reset the robot to be somewhere closer to the center of the rail. The second condition is if the pole has gone too far away from the upright position. Lastly, if the environment has not been reset for the pre-defined max_episode_length
number of steps, we force a reset for the environment so that it can continue to learn from new initial states. This helpes the policy to explore beyond the local minimum. Finally, we assign the resets to the reset buffer, reset_buf
.
def is_done(self) -> None:
cart_pos = self.obs_buf[:, 0]
pole_pos = self.obs_buf[:, 2]
# check for which conditions are met and mark the environments that satisfy the conditions
resets = torch.where(torch.abs(cart_pos) > self._reset_dist, 1, 0)
resets = torch.where(torch.abs(pole_pos) > math.pi / 2, 1, resets)
resets = torch.where(self.progress_buf >= self._max_episode_length, 1, resets)
# assign the resets to the reset buffer
self.reset_buf[:] = resets
9.2.5. Register the New Task
All tasks in OIGE are registered through the task_util.py` script. In this script, we will find a list of import statements for each available task and a dictionary mapping the names of the tasks to their corresponding class name. When we add a new task to the OIGE framework, we need to make sure we import the task along with the existing tasks by adding
from omniisaacgymenvs.tasks.cartpole import CartpoleTask
as well as adding an entry to the task_map dictionary. The mapping is in the form of “TaskName”: ClassName.
"Cartpole": CartpoleTask,
9.2.6. Understanding Config Files
All tasks in OIGE share a common main config file, specifying parameters related to simulation and RL devices, training/inferencing modes, and task selection. Parameters in this config file can be specified through command line arguments. More detailed explanations of each parameter can be found here.
In addition, each task has two individual config files named after the name of the task. The first config file specifies information related to the task and simulation. The name of the file must be named as TaskName.yaml
. For the Cartpole task, we use the Cartpole.yaml file.
The config file is as follows, with comments added to explain the parameters.
# name of the task - this should match the name used in the task mapping dictionary in task_util.py
name: Cartpole
# physics engine - only physx is currently supported. This value does not need to be modified.
physics_engine: ${..physics_engine}
# task-related parameters
env:
# number of environments to create
numEnvs: ${resolve_default:512,${...num_envs}}
# spacing between each environment (in meters)
envSpacing: 4.0
# Cartpole reset distance limit
resetDist: 3.0
# Cartpole effort scaling
maxEffort: 400.0
# clip values in observation buffer to be within this range (-5.0 to +5.0)
clipObservations: 5.0
# clip values in actions to be within this range (-1.0 to +1.0)
clipActions: 1.0
# perform 2 simulation steps for every action (applies actions every 2 simulation steps)
controlFrequencyInv: 2 # 60 Hz
# simulation related parameters
sim:
# simulation dt (dt between each simulation step)
dt: 0.0083 # 1/120 s
# whether to use the GPU pipeline - data returned from Isaac Sim APIs will be on the GPU if set to True
use_gpu_pipeline: ${eq:${...pipeline},"gpu"}
# gravity vector for the simulation scene
gravity: [0.0, 0.0, -9.81]
# whether to add a ground plane to the world
add_ground_plane: True
# whether to add lighting to the world
add_distant_light: False
# enable flatcache - this is required for rendering
use_flatcache: True
# disable scene query - this will disable interaction with the scene to improve performance
# this must be set to True for ray casting
enable_scene_query_support: False
# disable additional contact processing to improve performance. This should be set to True when using RigidContactView
disable_contact_processing: False
# set to True if you use camera sensors in the environment
enable_cameras: False
# default parameters if no additional physics materials are specified
default_physics_material:
static_friction: 1.0
dynamic_friction: 1.0
restitution: 0.0
# PhysX related parameters
# Additional USD physics schema documentation can be found here: https://docs.omniverse.nvidia.com/kit/docs/omni_usd_schema_physics/104.2/class_physx_schema_physx_scene_a_p_i.html
physx:
worker_thread_count: ${....num_threads}
solver_type: ${....solver_type}
use_gpu: ${eq:${....sim_device},"gpu"} # set to False to run on CPU
solver_position_iteration_count: 4
solver_velocity_iteration_count: 0
contact_offset: 0.02
rest_offset: 0.001
bounce_threshold_velocity: 0.2
friction_offset_threshold: 0.04
friction_correlation_distance: 0.025
enable_sleeping: True
enable_stabilization: True
max_depenetration_velocity: 100.0
# GPU buffers
gpu_max_rigid_contact_count: 524288
gpu_max_rigid_patch_count: 81920
gpu_found_lost_pairs_capacity: 1024
gpu_found_lost_aggregate_pairs_capacity: 262144
gpu_total_aggregate_pairs_capacity: 1024
gpu_max_soft_body_contacts: 1048576
gpu_max_particle_contacts: 1048576
gpu_heap_capacity: 67108864
gpu_temp_buffer_capacity: 16777216
gpu_max_num_partitions: 8
# each asset in the task can override physics parameters defined in the scene
# the name of the asset must match the name of the ArticulationView for the asset in the task
# additional Articulation and rigid body documentation can be found at https://docs.omniverse.nvidia.com/kit/docs/omni_usd_schema_physics/104.2/class_physx_schema_physx_articulation_a_p_i.html and https://docs.omniverse.nvidia.com/kit/docs/omni_usd_schema_physics/104.2/class_physx_schema_physx_rigid_body_a_p_i.html
Cartpole:
# a value of -1 means to use the same values defined in the physics scene
override_usd_defaults: False
enable_self_collisions: False
enable_gyroscopic_forces: True
# also in stage params
# per-actor
solver_position_iteration_count: 4
solver_velocity_iteration_count: 0
sleep_threshold: 0.005
stabilization_threshold: 0.001
# per-body
density: -1
max_depenetration_velocity: 100.0
# per-shape
contact_offset: 0.02
rest_offset: 0.001
In addition to the task config file, we also use a config file for defining training parameters. This file will be parsed by the RLGames framework. For Cartpole, we use the PPO algorithm for training, so we name the config file as CartpolePPO.yaml. Training parameters such as network architectures, learning_rate
, horizon_length
, and gamma
can be tuned using this file. An example of the Cartpole training config file is as follows.
params:
seed: ${...seed}
algo:
name: a2c_continuous
model:
name: continuous_a2c_logstd
network:
name: actor_critic
separate: False
space:
continuous:
mu_activation: None
sigma_activation: None
mu_init:
name: default
sigma_init:
name: const_initializer
val: 0
fixed_sigma: True
mlp:
units: [32, 32]
activation: elu
initializer:
name: default
regularizer:
name: None
load_checkpoint: ${if:${...checkpoint},True,False} # flag which sets whether to load the checkpoint
load_path: ${...checkpoint} # path to the checkpoint to load
config:
name: ${resolve_default:Cartpole,${....experiment}}
full_experiment_name: ${.name}
device: ${....rl_device}
device_name: ${....rl_device}
env_name: rlgpu
multi_gpu: ${....multi_gpu}
ppo: True
mixed_precision: False
normalize_input: True
normalize_value: True
num_actors: ${....task.env.numEnvs}
reward_shaper:
scale_value: 0.1
normalize_advantage: True
gamma: 0.99
tau: 0.95
learning_rate: 3e-4
lr_schedule: adaptive
kl_threshold: 0.008
score_to_win: 20000
max_epochs: ${resolve_default:100,${....max_iterations}}
save_best_after: 50
save_frequency: 25
grad_norm: 1.0
entropy_coef: 0.0
truncate_grads: True
e_clip: 0.2
horizon_length: 16
minibatch_size: 8192
mini_epochs: 8
critic_coef: 4
clip_value: True
seq_len: 4
bounds_loss_coef: 0.0001
9.2.7. Running the Example
To train the task using the viewer, run the following command, where the task
argument holds the name of the task that was registered in task_util.py
:
PYTHON_PATH scripts/rlgames_train.py task=Cartpole
We will see an Isaac Sim window pop up. Once Isaac Sim initialization completes (which may take a few minutes if launching for the first time), the Cartpole scene will be constructed and simulation will start running automatically. The process will terminate once training finishes.
Once the policy has been trained, we can load the trained checkpoint and perform inference (no training) by adding test=True
as an argument, along with the checkpoint name.
PYTHON_PATH scripts/rlgames_train.py task=Cartpole test=True checkpoint=runs/Cartpole/nn/Cartpole.pth
9.2.8. Summary
This tutorial covered the following topics:
Introduction to the tasking framework
Implementation of the Cartpole task
Registering the task to the OIGE framework
Understanding config files for the task
Steps for running the Cartpole task
9.2.8.1. Next Steps
Continue on to the next tutorial in our Reinforcement Learning Tutorials series, Extension Workflow for RL, to learn about the extension workflow for reinforcement learning.
If you are interested in setting up a new reinforcement learning task outside of OmniIsaacGymEnvs, please see the tutorial, Custom RL Example using Stable Baselines, to learn about using the stable baselines library with Isaac Sim.