10.7. Training Pose Estimation Model with Synthetic Data
10.7.1. Learning Objectives
50-60 min tutorial Prerequisites
This tutorial requires a working knowledge of the Offline Pose Estimation tutorial.
This tutorial also requires a basic understanding of how to submit jobs on NGC using Base Command. Documentation for how to do so can be found here.
10.7.2. Generating Data on NGC
Generating data on NGC using the OVX clusters allows you to drastically increase the amount of data you can generate compared to your local machine.
We use the OVX clusters for data generation since they are optimized for rendering jobs.
For training, we will use the DGX clusters, which are optimized for machine learning.
Because we will be using two different clusters for generation and training, we will automatically save our generated data to an s3
bucket, which we will then use to load data in during training. Building Your Own Container for Data Generation
In order to build a container to run on NGC, we can use a Dockerfile
. To do so, copy the contents below into a file called Dockerfile
Place the Dockerfile
in standalone_examples/replicator/offline_pose_generation
Dockerfile to create container for NGC
# See https://catalog.ngc.nvidia.com/orgs/nvidia/containers/isaac-sim # for instructions on how to run this container FROM nvcr.io/nvidia/isaac-sim:2023.1.1 RUN apt-get update && export DEBIAN_FRONTEND=noninteractive && apt-get install s3cmd -y # Copies over latest changes to pose generation code when building the container COPY ./ standalone_examples/replicator/offline_pose_generation
Any updates you have made locally to the offline_pose_generation.py
and other files in the
folder will be copied over to the container when you build.
This enables workflows where you need to modify the existing files inside offline_pose_generation/
(e.g. to generate data for a custom object by modifying the config/
To build the container, run:
cd standalone_examples/replicator/offline_pose_generation
docker build -t NAME_OF_YOUR_CONTAINER:TAG . Pushing Docker Container to NGC
To use this new container in NGC, we have to push it first. In order to push a container to NGC, you need to authenticate first. You can do so following this NGC guide.
When pushing a container to NGC, there is a specific naming format that must be followed.
The name of the container must be nvcr.io/<ORGANIZATION_NAME>/<TEAM_NAME>/<CONTAINER_NAME>:<TAG>
For more details on pushing containers to NGC, see this guide.
docker push NAME_OF_YOUR_CONTAINER:TAG Adding S3 Credentials to NGC Jobs
If you are planning on using data from an s3
bucket or writing your results to an s3
you need to add your credentials as part of the job definition. Unfortunately, there is no good way
to manage secrets on NGC currently so this has to be done manually.
You can upload your credentials by appending the command below to the beginning of your Run Command
in your job definition.
Make sure to fill in your credentials in the places marked.
# Credentials for boto3
mkdir ~/.aws
echo "[default]" >> ~/.aws/config
echo "aws_access_key_id = <YOUR_USER_NAME>" >> ~/.aws/config
echo "aws_secret_access_key = <YOUR_SECRET_KEY>" >> ~/.aws/config
# Credentials for s3cmd
echo "[default]" >> ~/.s3cfg
echo "use_https = True" >> ~/.s3cfg
echo "access_key = <YOUR_USER_NAME>" >> ~/.s3cfg
echo "secret_key = <YOUR_SECRET_KEY>" >> ~/.s3cfg
echo "bucket_location = us-east-1" >> ~/.s3cfg
echo "host_base = <YOUR_ENDPOINT>" >> ~/.s3cfg
echo "host_bucket = bucket-name" >> ~/.s3cfg
After pushing the container to NGC, we select this container when creating a job. You can use the following run command:
# (see "Adding S3 Credentials to NGC Jobs" section above for more details)
# Run Pose Generation
./python.sh standalone_examples/replicator/offline_pose_generation/offline_pose_generation.py \
--use_s3 --endpoint https://YOUR_ENDPOINT --bucket OUTPUT_BUCKET --num_dome 1000 --num_mesh 1000 --writer DOPE \
flag is passed in when running the script to run Isaac Sim inheadless
mode. This overrides any other settings that determine whether the app will run inheadless
mode or not. Without this flag, we could get an error if the config file we pass in has"headless": false
since it is not possible to launch an Isaac Sim window when running in a Docker container. Things to Note
In order to submit a job on the OVX clusters, it must be made pre-emptable. To do this, select
under Preemption Options when creating the job.
10.7.3. Train, Inference, and Evaluate Running Locally
To run the training, inference, and evaluation scripts locally, clone the Dope Training Repo and follow the instructions in the README.md file within the repo. Running on NGC
NGC offers users the ability to scale their training jobs. Since DOPE needs to be trained separately for each class of object, NGC is extremely helpful in enabling multiple models to be trained at once. Furthermore, it reduces the time needed to train models by providing the option to run multi-GPU jobs.
When creating a job, simply copy over the command below to be used as your job’s Run Command
on NGC.
Be sure to change the parameters according to your need.
If you would like to run the entire training, inference, and evaluation pipeline in one go, you can refer to the Running Entire Pipeline in One Command section below.
# (see "Adding S3 Credentials to NGC Jobs" section for more details)
# Change values below:
export endpoint="https://YOUR_ENDPOINT"
export num_gpus=1
export train_buckets="BUCKET_1 BUCKET_2"
export batchsize=32
export epochs=60
export object="CLASS_OF_OBJECT"
export output_bucket="OUTPUT_BUCKET"
export inference_data="PATH_TO_INFERENCE_DATA"
# Run Training
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
train.py --use_s3 \
--train_buckets $train_buckets \
--endpoint $endpoint \
--object $object \
--batchsize $batchsize \
--epochs $((epochs / num_gpus))
# Copy Inference Data Locally
mkdir sample_data/inference_data
s3cmd sync s3://$inference_data sample_data/inference_data
# Run Inference
cd inference/
python inference.py \
--weights ../output/weights \
--data ../sample_data/inference_data \
--object $object
# Run Evaluation
cd ../evaluate
python evaluate.py \
--data_prediction ../inference/output \
--data ../sample_data/inference_data \
--outf ../output/ \
# Store Training and Evaluation Results
cd ../
s3cmd mb s3://$output_bucket
s3cmd sync output/ s3://$output_bucket Running Entire Pipeline in One Command
To make running the entire pipeline easier on NGC, there is also a script run_pipeline_on_ngc.py
that can run the entire pipeline with one command. Below is an example of an NGC run command that
uses the script to run the entire pipeline:
# (see "Adding S3 Credentials to NGC Jobs" section for more details)
python run_pipeline_on_ngc.py \
--num_gpus 1 \
--endpoint https://ENDPOINT \
--object YOUR_OBJECT \
--train_buckets YOUR_BUCKET \
--inference_bucket YOUR_INFERENCE_BUCKET \
--output_bucket YOUR_OUTPUT_BUCKET Building Your Own Training Container with Dockerfile
The easiest way to run this pipeline is with the existing container on NGC that is linked above.
Alternatively, there is a Dockerfile
in the Dope Training Repo.
You can use this to build your own Docker image.
Note that this Dockerfile
uses the PyTorch Container from NGC
as the base image.
Then, assuming that you are in the directory where the Dockerfile
is, you can run the command below.
For additional information on building Docker images, refer to the official Docker guide.
cd docker
docker build -t nvcr.io/nvidian/onboarding/sample-image-dope-training:1.0 .
Here, nvcr.io/nvidian/onboarding/sample-image-dope-training
is the name of the image we want to build
and 1.0
is the tag. We use this naming convention in order to upload our image as a container to NGC.
The reason we need to run get_nvidia_libs.sh
is because the visii
module that is used in evaluate.py
requires drivers that are not in the default PyTorch container we build off of. Thus, we need to manually copy the files over.
Then, to push this container to NGC, we can follow the same steps listed in the Pushing Docker Container to NGC section above.
10.7.4. Summary
This tutorial covered the following topics:
How to generate synthetic data with Isaac Sim on the OVX clusters on NGC. Using these clusters enables you to scale up your synthetic data generation.
How to train and evaluate a DOPE model on NGC using data that has been uploaded to an
bucket. This enables you to scale up your model training by training multiple models at once on clusters with multiple GPUs.