3D Labeling Without A Depth Camera
We recently received a request from a customer that wanted to use the Stray 3D labeling pipeline using only color image frames. This made absolute sense to use, as in most applications, depth sensors aren't available.
In this post, we will show you how to harness the power of 3D labeling using only plain old video as input.
The benefits of 3D labeling
In a previous post, we have shown you the benefits of labeling in 3D. Compared to labeling images frame-by-frame in 2D. Labeling in 3D can save you hours and hours of manual work when creating datasets for semantic segmentation or object bounding box detection.
Previously, our toolkit only worked with color images (RGB) that also have a depth map to go along with each frame. A depth map tells us how far away points in the image are for each pixel in the image. This meant that you had to collect the data using a depth camera, such as one of the cameras from the Intel RealSense product line. In practice this is a huge restriction, as most of the video shot today, is recorded on regular cameras.
We are happy to announce that in addition to the RGB+depth based pipeline, the Stray toolkit now includes a pipeline that can work with color images only. This allows you to use all the benefits of 3D labeling even when working with only image data
A practical example
To showcase the new pipeline in action, we collected a few 1-minute long video clips of shoes, which we wanted to segment throughout the video clip. An example of such a clip is shown below.
After shooting the video, we upload it to a computer with the Stray CLI command line interface installed. With a few simple commands, we can process the video into a 3D mesh representation and estimate the camera poses.
After we have processed the video, we move on to labeling our objects of interest in the scene using the Stray Studio graphical user interface. With the Studio tool, you can add bounding boxes and keypoints for all the instances that you want to annotate.
Once we are happy with the bounding boxes, we can project the 3D labels onto each 2D images. The segmentation masks are calculated based on the mesh of the scene and the 3D bounding boxes that were annotated. So there is no need to "paint" the mask in either 3D or 2D, which saves a lot of time. The 2D bounding boxes are created from the segmentation mask or the 3D bounding box, whichever happens to suit the particular case better. For our shoe dataset, we went with the mask based approach.
Below you can see the original video clip, but including the segmentation mask and a bounding box. We can see that the projected labels are really accurate, both up close and when observing the object from a distance.
We would like to highlight that labeling this single scene took us less than a minute of labeling time.
Have a use case in mind?
The new pipeline is currently being tested with a handful of beta customers, if you'd like to be one of them and have a use case in mind, do not hesitate to reach out, we'd very much like to hear from you! You can email us at firstname.lastname@example.org.