9.2. Creating a new RL Example in OmniIsaacGymEnvs

A collection of reinforcement learning environments are provided at OmniIsaacGymEnvs, which can be used as examples and baselines to build new RL environments with Isaac Sim. In this tutorial, we will dive into the details of the tasking framework for running a RL environment.

9.2.1. Learning Objectives

We will walk through a simple example in OmniIsaacGymEnvs (OIGE) and learn how to set up a new example using the OIGE framework. We will

Introduce the tasking framework
Implement a Cartpole task
Register the task to the OIGE framework
Understand config files for the task
Run the Cartpole task

20-30 Minute Tutorial

9.2.2. Getting Started

Refer to Default Python Environment to learn about Isaac Sim’s python environment and locate the python executable in Isaac Sim.
Refer to Workstation Installation for setting up the Nucleus server for access to pre-trained checkpoints.
Refer to Overview & Getting Started for setting up OmniIsaacGymEnvs.

9.2.3. Introduction to the Tasking Framework

The main entry script for OmniIsaacGymEnvs is rlgames_train.py, which can be found here. This script parses input parameters using the Hydra Framework, which is a framework used to handle configuration parameters. All tasks in OIGE define config files in .yaml files, which are then parsed using Hydra.

The script is also responsible for initializing our Environment object. In Isaac Sim, we define a VecEnvBase class in the omni.isaac.gym extension as an interface for connecting RL libraries with the simulation in Isaac Sim and our task. In OIGE, we extend the VecEnvBase class for the rlgames RL library, which is a vectorized GPU-based RL library we use for training in OIGE. This new environment object, VecEnvRLGames, is responsible for passing actions from the RL library to our task, stepping simulation, and retrieving observations, rewards, and resets buffers for the RL library.

When launching a task in OIGE, we use the following command:

PYTHON_PATH scripts/rlgames_train.py task=Cartpole

The rlgames_train.py script parses an important parameter from the command, task=Cartpole. This will in turn initialize an instance of the Cartpole task using a utility defined in task_util.py. The task initialized will be explained in the next section.

In addition to the task implementation, each task in OIGE is also paired with 2 config files: one for the task parameters and another for the training parameters. We will walk through the config files in details in the later sections.

9.2.4. Task Implementation Walk-through - Cartpole

In this section, we will look at one of the simple examples in OmniIsaacGymEnvs, the Cartpole environment. We will look at each section of the Cartpole task and understand the various APIs used for defining the task. Code for the Cartpole environment is available here. The goal of this task is to balance the pole of the robot to stay in an upright position by gently sliding the cart along the rail.

The first piece of creating a new RL environment is to define a new class for the task. This task class will be responsible for performing resets, applying actions, collecting observations and computing rewards. In this example, we will define a class called CartpoleTask to implement our Cartpole environment.

The class takes in 3 required arguments and an optional offset argument, as shown in the __init__ method below.

class CartpoleTask(RLTask):
    def __init__(self, name, sim_config, env, offset=None) -> None:

The first argument is the name of the class, which will be parsed from the task config file. The second argument is a SimConfig object, which contains task and physics parameters parsed from the task config file. The SimConfig object will generally contain task information, such as the number of environments used for the task, and physics parameters, including simulation dt, GPU buffer dimensions, and physics properties for individual assets used in the task. The third argument is the environment object, which in our case, will be a VecEnvRLGames object defined by the rlgames_train.py script. All of these parameters will be parsed and processed automatically by the task_util.py script, which is called when a training run is launched from command line.

In our constructor method, we will intialize a few key variables. self._num_observations and self._num_actions are required variables that must be set by the task implementation. For our Cartpole task, we will use 4 observations to represent the joint positions and velocities for the cart and pole joints, and 1 action to apply as the force to the cart.

class CartpoleTask(RLTask):
    def __init__(self, name, sim_config, env, offset=None) -> None:
        self.update_config(sim_config)
        self._max_episode_length = 500

        # these must be defined in the task class
        self._num_observations = 4
        self._num_actions = 1

        # call the parent class constructor to initialize key RL variables
        RLTask.__init__(self, name, env)

In addition, we call the base RLTask class constructor to initialize some key variables that are common to all task classes. These variables include setups for domain randomization, defining action and observation spaces if not specified by the task class, initializing the Cloner for environment set up, and defining the reward, obervation, and reset buffers for our task.

Notice in the first line of the constructor, we called self.update_config(sim_config), which is a utility method for parsing the SimConfig object, as defined below.

def update_config(self, sim_config):
    # extract task config from main config dictionary
    self._sim_config = sim_config
    self._cfg = sim_config.config
    self._task_cfg = sim_config.task_config

    # parse task config parameters
    self._num_envs = self._task_cfg["env"]["numEnvs"]
    self._env_spacing = self._task_cfg["env"]["envSpacing"]
    self._cartpole_positions = torch.tensor([0.0, 0.0, 2.0])

    # reset and actions related variables
    self._reset_dist = self._task_cfg["env"]["resetDist"]
    self._max_push_effort = self._task_cfg["env"]["maxEffort"]

In this method, we can extract the task config dictionary from the main config dictionary, allowing for easier access to task-related parameters, such as number of environments, spacing between environments, the distance that should trigger a reset, and the effort scaling we should apply to our actions.

Next, we will define our methods for setting up our simulation world. This is where we will define our robot asset, apply physics parameters to the robot, and clone it multiple times to build our vectorized environment.

def set_up_scene(self, scene) -> None:
    # first create a single environment
    self.get_cartpole()

    # call the parent class to clone the single environment
    super().set_up_scene(scene)

    # construct an ArticulationView object to hold our collection of environments
    self._cartpoles = ArticulationView(
        prim_paths_expr="/World/envs/.*/Cartpole", name="cartpole_view", reset_xform_properties=False
    )

    # register the ArticulationView object to the world, so that it can be initialized
    scene.add(self._cartpoles)

def get_cartpole(self):
    # add a single robot to the stage
    cartpole = Cartpole(
        prim_path=self.default_zero_env_path + "/Cartpole", name="Cartpole", translation=self._cartpole_positions
    )

    # applies articulation settings from the task configuration yaml file
    self._sim_config.apply_articulation_settings(
        "Cartpole", get_prim_at_path(cartpole.prim_path), self._sim_config.parse_actor_config("Cartpole")
    )

Note that each task class should define a set_up_scene method, as this will be triggered automatically by the Isaac Sim framework when initializing the world. In this method, we first construct a single environment (a single Cartpole robot), by calling get_cartpole(). This method creates an instance of the Cartpole robot class, which simply adds a Cartpole asset into the scene. Then, we use the SimConfig utilities to apply physics settings to the robot.

Once the single environment has been constructed, we call the RLTask super().set_up_scene(scene) method to further construct our stage. RLTask implements common utilities that are used by all tasks defined in OIGE. In set_up_scene, RLTask applies the Cloner to duplicate the single environment based on the number of environments variable defined in the task. It also processes settings from the config file to add a ground plane and additional light prims to the stage. In addition, collision filtering is applied to all clones of the envrionments to avoid interference across environments.

After the environment setup is complete, we can then define our post_reset method, which will also be automatically triggered by the Isaac Sim framework at the very beginning of simulation.

def post_reset(self):
    # retrieve cart and pole joint indices
    self._cart_dof_idx = self._cartpoles.get_dof_index("cartJoint")
    self._pole_dof_idx = self._cartpoles.get_dof_index("poleJoint")

    # randomize all envs
    indices = torch.arange(self._cartpoles.count, dtype=torch.int64, device=self._device)
    self.reset_idx(indices)

In this method, we perform initializations with APIs that require simulation to be running, such as the vectorized tensor APIs available in ArticulationView. Here, we retrieve the joint indices of the cart and pole joints to be used when computing the observations and rewards. Then, we perform an initial reset for all of the environments to initialize them into a starting state for a new episode of RL training.

Below, we show the reset method for the Cartpole task. This method takes in one argument, env_ids, which contains a tensor holding the indices of the environments that should be reset. In this task, we reset the joint positions and velocities of the Cartpole robot to a randomized state to ensure that the RL policy can learn to perform the task from arbitrary starting states.

def reset_idx(self, env_ids):
    num_resets = len(env_ids)

    # randomize DOF positions
    dof_pos = torch.zeros((num_resets, self._cartpoles.num_dof), device=self._device)
    dof_pos[:, self._cart_dof_idx] = 1.0 * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))
    dof_pos[:, self._pole_dof_idx] = 0.125 * math.pi * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))

    # randomize DOF velocities
    dof_vel = torch.zeros((num_resets, self._cartpoles.num_dof), device=self._device)
    dof_vel[:, self._cart_dof_idx] = 0.5 * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))
    dof_vel[:, self._pole_dof_idx] = 0.25 * math.pi * (1.0 - 2.0 * torch.rand(num_resets, device=self._device))

    # apply randomized joint positions and velocities to environments
    indices = env_ids.to(dtype=torch.int32)
    self._cartpoles.set_joint_positions(dof_pos, indices=indices)
    self._cartpoles.set_joint_velocities(dof_vel, indices=indices)

    # reset the reset buffer and progress buffer after applying reset
    self.reset_buf[env_ids] = 0
    self.progress_buf[env_ids] = 0

Note that both dof_pos and dof_vel tensors are initialized with dimensions covering all joints in the robot. This is required by the ArticulationView APIs set_joint_positions and set_joint_velocities, which are used to apply the reset states for the robots. However, we can provide an additional indices argument to the APIs to specify the environment indices that need to be reset. Values for indices that do not need to be reset can be left as zeros. Upon resetting the states of the environments, we also reset the reset buffer and progress buffer to zeros for the corresponding environments. The reset buffer reset_buf is a boolean tensor used to track which environments should be reset at each step and the progress buffer progress_buf is an integer tensor that tracks the number of steps each environment has simulated through since it was last reset.

At each step, before stepping the simulator, we should apply the actions from the RL policy to the robots so that they can be applied during the next physics step. Similarly, we should also reset any environments that are marked for reset by the reset_buf such that these environments will be in a fresh state after we step simulation. This logic is implemented in the pre_physics_step call, which takes in the actions from the RL policy as a parameter, and is called prior to stepping simulation each step.

def pre_physics_step(self, actions) -> None:
    # make sure simulation has not been stopped from the UI
    if not self._env._world.is_playing():
        return

    # extract environment indices that need reset and reset them
    reset_env_ids = self.reset_buf.nonzero(as_tuple=False).squeeze(-1)
    if len(reset_env_ids) > 0:
        self.reset_idx(reset_env_ids)

    # make sure actions buffer is on the same device as the simulation
    actions = actions.to(self._device)

    # compute forces from the actions
    forces = torch.zeros((self._cartpoles.count, self._cartpoles.num_dof), dtype=torch.float32, device=self._device)
    forces[:, self._cart_dof_idx] = self._max_push_effort * actions[:, 0]

    # apply actions to all of the environments
    indices = torch.arange(self._cartpoles.count, dtype=torch.int32, device=self._device)
    self._cartpoles.set_joint_efforts(forces, indices=indices)

In pre_physics_step, we first check whether there are any environments that are marked for reset. If so, we reset these environments by calling the reset_idx method explained above. Then, we process the actions received from the RL policy and translate the buffer into forces by applying a scaling factor. Note that similar to the joint state APIs, we also need to initialize a forces tensor that covers the full dimensions for all joints in the robot. However, the RL policy only provides a single action for the cart joint, which we assign to our forces buffer accordingly. Finally, we apply the forces to the robot using the set_joint_efforts API from ArticulationView.

Next, we will implement our observations method, get_observations. This method extracts states from simulation and populates the observation buffer obs_buf with joint posiitons and velocities for the cart and pole joints. This information is sufficient for the RL policy to learn to the perform the task. At the end of the method, we construct a dictionary containing the observation buffer and return it, following the API definition of the get_observations method in Isaac Sim framework’s BaseTask class.

def get_observations(self) -> dict:
    # retrieve joint positions and velocities
    dof_pos = self._cartpoles.get_joint_positions(clone=False)
    dof_vel = self._cartpoles.get_joint_velocities(clone=False)

    # extract joint states for the cart and pole joints
    cart_pos = dof_pos[:, self._cart_dof_idx]
    cart_vel = dof_vel[:, self._cart_dof_idx]
    pole_pos = dof_pos[:, self._pole_dof_idx]
    pole_vel = dof_vel[:, self._pole_dof_idx]

    # populate the observations buffer
    self.obs_buf[:, 0] = cart_pos
    self.obs_buf[:, 1] = cart_vel
    self.obs_buf[:, 2] = pole_pos
    self.obs_buf[:, 3] = pole_vel

    # construct the observations dictionary and return
    observations = {self._cartpoles.name: {"obs_buf": self.obs_buf}}
    return observations

From the observations, we can then also compute the reward for the task. The reward function is defined in the calculate_metrics API, also following the interface from BaseTask. In this method, we calculate the reward based on the angle of the pole. The closer it is to 0, meaning it is in an upright position, the higher the reward will be. In addition, we add small penalty terms to penalize the cart and pole from moving too fast. This is implemented by taking the absolute values of the cart and pole joint velocities. These terms will help ensure that actions from the RL policy provide smooth movements for the robots. We also provide a larger penalty of -2 for when the robot reaches a bad state, such as for moving beyond the reset distance limit and if the pole goes beyond 90 degrees in either direction. Finally, we assign the computed reward to the reward buffer, rew_buf, which will then be passed to the RL framework.

def calculate_metrics(self) -> None:
    # use states from the observation buffer to compute reward
    cart_pos = self.obs_buf[:, 0]
    cart_vel = self.obs_buf[:, 1]
    pole_angle = self.obs_buf[:, 2]
    pole_vel = self.obs_buf[:, 3]

    # define the reward function based on pole angle and robot velocities
    reward = 1.0 - pole_angle * pole_angle - 0.01 * torch.abs(cart_vel) - 0.5 * torch.abs(pole_vel)
    # penalize the policy if the cart moves too far on the rail
    reward = torch.where(torch.abs(cart_pos) > self._reset_dist, torch.ones_like(reward) * -2.0, reward)
    # penalize the policy if the pole moves beyond 90 degrees
    reward = torch.where(torch.abs(pole_angle) > np.pi / 2, torch.ones_like(reward) * -2.0, reward)

    # assign rewards to the reward buffer
    self.rew_buf[:] = reward

Lastly, we have a method for determining which environments need to be reset, which is implemented in is_done. We placed multiple conditions for when an environment is due for reset. The first condition is when the robot moves beyond the preset limit of the reset distance. This means the robot has moved too far across the rail and has reached the end of the rail. In such a case, we reset the robot to be somewhere closer to the center of the rail. The second condition is if the pole has gone too far away from the upright position. Lastly, if the environment has not been reset for the pre-defined max_episode_length number of steps, we force a reset for the environment so that it can continue to learn from new initial states. This helpes the policy to explore beyond the local minimum. Finally, we assign the resets to the reset buffer, reset_buf.

def is_done(self) -> None:
    cart_pos = self.obs_buf[:, 0]
    pole_pos = self.obs_buf[:, 2]

    # check for which conditions are met and mark the environments that satisfy the conditions
    resets = torch.where(torch.abs(cart_pos) > self._reset_dist, 1, 0)
    resets = torch.where(torch.abs(pole_pos) > math.pi / 2, 1, resets)
    resets = torch.where(self.progress_buf >= self._max_episode_length, 1, resets)

    # assign the resets to the reset buffer
    self.reset_buf[:] = resets

9.2.5. Register the New Task

All tasks in OIGE are registered through the task_util.py` script. In this script, we will find a list of import statements for each available task and a dictionary mapping the names of the tasks to their corresponding class name. When we add a new task to the OIGE framework, we need to make sure we import the task along with the existing tasks by adding

from omniisaacgymenvs.tasks.cartpole import CartpoleTask

as well as adding an entry to the task_map dictionary. The mapping is in the form of “TaskName”: ClassName.

"Cartpole": CartpoleTask,

9.2.6. Understanding Config Files

All tasks in OIGE share a common main config file, specifying parameters related to simulation and RL devices, training/inferencing modes, and task selection. Parameters in this config file can be specified through command line arguments. More detailed explanations of each parameter can be found here.

In addition, each task has two individual config files named after the name of the task. The first config file specifies information related to the task and simulation. The name of the file must be named as TaskName.yaml. For the Cartpole task, we use the Cartpole.yaml file.

The config file is as follows, with comments added to explain the parameters.

# name of the task - this should match the name used in the task mapping dictionary in task_util.py
name: Cartpole

# physics engine - only physx is currently supported. This value does not need to be modified.
physics_engine: ${..physics_engine}

# task-related parameters
env:
  # number of environments to create
  numEnvs: ${resolve_default:512,${...num_envs}}
  # spacing between each environment (in meters)
  envSpacing: 4.0

  # Cartpole reset distance limit
  resetDist: 3.0
  # Cartpole effort scaling
  maxEffort: 400.0

  # clip values in observation buffer to be within this range (-5.0 to +5.0)
  clipObservations: 5.0
  # clip values in actions to be within this range (-1.0 to +1.0)
  clipActions: 1.0
  # perform 2 simulation steps for every action (applies actions every 2 simulation steps)
  controlFrequencyInv: 2 # 60 Hz

# simulation related parameters
sim:
  # simulation dt (dt between each simulation step)
  dt: 0.0083 # 1/120 s
  # whether to use the GPU pipeline - data returned from Isaac Sim APIs will be on the GPU if set to True
  use_gpu_pipeline: ${eq:${...pipeline},"gpu"}
  # gravity vector for the simulation scene
  gravity: [0.0, 0.0, -9.81]

  # whether to add a ground plane to the world
  add_ground_plane: True
  # whether to add lighting to the world
  add_distant_light: False

  # enable flatcache - this is required for rendering
  use_flatcache: True
  # disable scene query - this will disable interaction with the scene to improve performance
  # this must be set to True for ray casting
  enable_scene_query_support: False
  # disable additional contact processing to improve performance. This should be set to True when using RigidContactView
  disable_contact_processing: False

  # set to True if you use camera sensors in the environment
  enable_cameras: False

  # default parameters if no additional physics materials are specified
  default_physics_material:
    static_friction: 1.0
    dynamic_friction: 1.0
    restitution: 0.0

  # PhysX related parameters
  # Additional USD physics schema documentation can be found here: https://docs.omniverse.nvidia.com/kit/docs/omni_usd_schema_physics/104.2/class_physx_schema_physx_scene_a_p_i.html
  physx:
    worker_thread_count: ${....num_threads}
    solver_type: ${....solver_type}
    use_gpu: ${eq:${....sim_device},"gpu"} # set to False to run on CPU
    solver_position_iteration_count: 4
    solver_velocity_iteration_count: 0
    contact_offset: 0.02
    rest_offset: 0.001
    bounce_threshold_velocity: 0.2
    friction_offset_threshold: 0.04
    friction_correlation_distance: 0.025
    enable_sleeping: True
    enable_stabilization: True
    max_depenetration_velocity: 100.0

    # GPU buffers
    gpu_max_rigid_contact_count: 524288
    gpu_max_rigid_patch_count: 81920
    gpu_found_lost_pairs_capacity: 1024
    gpu_found_lost_aggregate_pairs_capacity: 262144
    gpu_total_aggregate_pairs_capacity: 1024
    gpu_max_soft_body_contacts: 1048576
    gpu_max_particle_contacts: 1048576
    gpu_heap_capacity: 67108864
    gpu_temp_buffer_capacity: 16777216
    gpu_max_num_partitions: 8

  # each asset in the task can override physics parameters defined in the scene
  # the name of the asset must match the name of the ArticulationView for the asset in the task
  # additional Articulation and rigid body documentation can be found at https://docs.omniverse.nvidia.com/kit/docs/omni_usd_schema_physics/104.2/class_physx_schema_physx_articulation_a_p_i.html and https://docs.omniverse.nvidia.com/kit/docs/omni_usd_schema_physics/104.2/class_physx_schema_physx_rigid_body_a_p_i.html
  Cartpole:
    # a value of -1 means to use the same values defined in the physics scene
    override_usd_defaults: False
    enable_self_collisions: False
    enable_gyroscopic_forces: True
    # also in stage params
    # per-actor
    solver_position_iteration_count: 4
    solver_velocity_iteration_count: 0
    sleep_threshold: 0.005
    stabilization_threshold: 0.001
    # per-body
    density: -1
    max_depenetration_velocity: 100.0
    # per-shape
    contact_offset: 0.02
    rest_offset: 0.001

In addition to the task config file, we also use a config file for defining training parameters. This file will be parsed by the RLGames framework. For Cartpole, we use the PPO algorithm for training, so we name the config file as CartpolePPO.yaml. Training parameters such as network architectures, learning_rate, horizon_length, and gamma can be tuned using this file. An example of the Cartpole training config file is as follows.

params:
  seed: ${...seed}

  algo:
    name: a2c_continuous

  model:
    name: continuous_a2c_logstd

  network:
    name: actor_critic
    separate: False
    space:
      continuous:
        mu_activation: None
        sigma_activation: None

        mu_init:
          name: default
        sigma_init:
          name: const_initializer
          val: 0
        fixed_sigma: True
    mlp:
      units: [32, 32]
      activation: elu

      initializer:
        name: default
      regularizer:
        name: None

  load_checkpoint: ${if:${...checkpoint},True,False} # flag which sets whether to load the checkpoint
  load_path: ${...checkpoint} # path to the checkpoint to load

  config:
    name: ${resolve_default:Cartpole,${....experiment}}
    full_experiment_name: ${.name}
    device: ${....rl_device}
    device_name: ${....rl_device}
    env_name: rlgpu
    multi_gpu: ${....multi_gpu}
    ppo: True
    mixed_precision: False
    normalize_input: True
    normalize_value: True
    num_actors: ${....task.env.numEnvs}
    reward_shaper:
      scale_value: 0.1
    normalize_advantage: True
    gamma: 0.99
    tau: 0.95
    learning_rate: 3e-4
    lr_schedule: adaptive
    kl_threshold: 0.008
    score_to_win: 20000
    max_epochs: ${resolve_default:100,${....max_iterations}}
    save_best_after: 50
    save_frequency: 25
    grad_norm: 1.0
    entropy_coef: 0.0
    truncate_grads: True
    e_clip: 0.2
    horizon_length: 16
    minibatch_size: 8192
    mini_epochs: 8
    critic_coef: 4
    clip_value: True
    seq_len: 4
    bounds_loss_coef: 0.0001

9.2.7. Running the Example

To train the task using the viewer, run the following command, where the task argument holds the name of the task that was registered in task_util.py:

PYTHON_PATH scripts/rlgames_train.py task=Cartpole

We will see an Isaac Sim window pop up. Once Isaac Sim initialization completes (which may take a few minutes if launching for the first time), the Cartpole scene will be constructed and simulation will start running automatically. The process will terminate once training finishes.

Once the policy has been trained, we can load the trained checkpoint and perform inference (no training) by adding test=True as an argument, along with the checkpoint name.

PYTHON_PATH scripts/rlgames_train.py task=Cartpole test=True checkpoint=runs/Cartpole/nn/Cartpole.pth

9.2.8. Summary

This tutorial covered the following topics:

Introduction to the tasking framework
Implementation of the Cartpole task
Registering the task to the OIGE framework
Understanding config files for the task
Steps for running the Cartpole task

9.2.8.1. Next Steps

Continue on to the next tutorial in our Reinforcement Learning Tutorials series, Extension Workflow for RL, to learn about the extension workflow for reinforcement learning.

If you are interested in setting up a new reinforcement learning task outside of OmniIsaacGymEnvs, please see the tutorial, Custom RL Example using Stable Baselines, to learn about using the stable baselines library with Isaac Sim.

9.2.8.2. Further Learning

For more details on the RL examples, please refer to the README page in OmniIsaacGymEnvs.
For more information on the frameworks used in OmniIsaacGymEnvs, please refer to the docs directory.