Automating labeling through 3D

February 7, 2022

Modern computer vision methods are mostly based on machine learning. These methods require huge datasets, often hundreds of thousands or millions of examples, to learn the parameters. Acquiring these datasets today often means outsourcing to a labeling service and spending tens or hundreds of thousands of dollars on the labor required to create the labels.

One problem we often see with companies outsourcing data annotation, especially in industrial applications, is that you still have to check all of the annotations to make sure they are all correct. The reason is that the annotators, typically working through worker pools, no longer understand what is being done and they simply rely on their intuition and the instructions. Often, annotators will jump to conclusions, as they get paid for getting something done vs. getting it right. 

Compared to regular 2D computer vision tasks, where the objective might be to recognize a cat or human from a single image, in robotics and augmented reality, we have the benefit of dealing with video data which is temporally correlated. Not only do we have the benefit of dealing with video, but we usually have multiple sensor readings that we can fuse together.

If we want to solve a 2D computer vision task, such as 2D bounding box detection, recovering the 3D information allows us to greatly speed up the labeling process by propagating labels from one frame to the next.

For 3D tasks, such as 3D bounding box detection or 3D keypoint detection, we have to recover and label the actual 3D representation, to be able to tell how far away objects are and which way they are oriented.


Take the case of this fire hydrant as an example. Say we wanted our software to detect it. The way we would do this, is by creating a dataset of fire hydrants in many different circumstances. Then we would annotate them with 2D bounding box labels and use something like Yolo to infer the bounding boxes on future examples. In this case, we will annotate entire scans in one go, instead of labeling individual images.

Here, we are using both camera images and depth measurements.

First we run it through our pipeline to fuse the camera and depth measurements to get a 3D representation out of it. Then we annotate it directly in 3D. Here is what the annotated scene looks like:

Fire hydrant scene 3D labeling in Stray Studio.
The fire hydrant scene opened in Stray Studio.

Once we have the 3D scene annotated, we can use this to generate labels for the task we are interested in solving. For example 2D or 3D bounding box detection.

That's it. This scene contained 1100 images, which we were able to annotate in less than a minute using the 3D labeling tool. If we wanted to create a dataset with for example a hundred thousand examples, we can very quickly do that by automatically processing and annotating some more scans of our target.

Let us know what you think.

Other Blog Posts