Launching 3D Studio

July 1, 2021

Our latest tool, Label Studio, allows teams working with computer vision to rapidly annotate large amounts of image data.

In spatial AI applications such as robotics and augmented reality, cameras move through a space and information needs to be inferred about the state of the scene. This can be things such as the position of objects, semantic segmentation (what pixel corresponds to what object), detecting planes, completing depth information or detecting specific points in the scene.

Most algorithms today solve these problems through machine learning. A large dataset is created with input and output examples, and a model is fitted to predict the outputs from the inputs.

This requires large datasets to do accurately. Typically these datasets are built up by taking image frames and outsourcing them to a large amount of workers who annotate the images one-by-one.

In spatial AI applications, image frames are highly correlated and have a specific relation to each other — they view the same scene, but from different viewpoints.

Annotate large amounts of data with a few clicks

Label Studio takes image frames from your robot or app and stitches them together to build a global 3D reconstruction of the scene. In the process, we build up a graph of camera poses, which allows us to infer the exact pose of the camera as it moves through the scene. This allows us to very quickly annotate large amounts of image frames by labeling the scene once using a 3D graphical user interface, and projecting the annotated 3D labels to all the images in the dataset.

If one scan of your scene contains 3600 images (60 seconds of video at 60 frames per second), you can label all 3600 images by adding your annotations with just a few clicks in Label Studio, and generating labeled image frames containing 2D input and output examples. You can repeat this for each of the scenes you care about, and very quickly get to the hundreds of thousands of examples required to train modern computer vision algorithms.

What follows, is a demonstration of what Label Studio can do.

Data Collection

Label Studio, takes as input RGB-D image frames. These are image frames with a corresponding depth map, as captured by depth cameras, such as the Intel RealSense, LiDAR enabled iOS devices or the Azure Kinect.

For the purposes of this demo, we used an iPhone 12 Pro and our Stray Scanner app.

On the left, we see the depth output from the iPhone 12 Pro. On the right the color images.

We save images from a scene in a folder, structured  as follows:

  • A color directory containing numbered jpg color images

    • 00000.jpg

    • 00001.jpg

    • ...

  • A depth directory containing png encoded depth frames

    • 00000.png

    • 00001.png

    • ...

  • A camera_intrinsics.json file, containing the intrinsic parameters of our camera.

We run a reconstruction of the scene using our stray command line tool. The command stray reconstruct reads the camera parameters, color and depth images and builds up the pose graph and reconstructs the scene.


After reconstructing the scene, we annotate it with Stray Label Studio. This is done with the stray studio <scene> command, which opens up our reconstructed scene in Label Studio. This is what it looks like:

Our scene in Label Studio.

In this case, we want to add 3D bounding boxes that encompasses the bottle in our scene.


Once we have labeled the scene, we can check what our annotations look like when superimposed on our image dataset. This is done by running the command stray preview <scene>, which plays though the images, showing the annotations.

Generating a Labeled Dataset

Once we are happy with the result, we can generate a dataset for learning using the command stray generate --yolo-bbox, which will create a dataset in the YOLO Darknet format. The annotations are saved in an annotations subdirectory containing numbered 00000.txt files, containing the bounding box annotations. The files contains a line for each instance in the scene.

0 <x> <y> <width> <height>

0 refers to the instance id.

<x> and <y> are float value between 0 and 1 denoting the position of the center of the bounding box, relative to the size of the image.

<width> and <height> correspond to the width and height of the bounding box, relative to the image size.

This dataset is now good to go, and can be used to train a YOLO object detector to detect the objects in new, unseen images.


In this post, we showed how Label Studio can be used to quickly create datasets for spatial AI applications.

Currently, Label Studio supports keypoint, 3d and 2d bounding box annotations types. Going forward, we will be adding other annotation types, including semantic segmentation, depth completion, 6D object poses and optical flow among others.

If you would like to try out Label Studio for your application, reach out to us here.

Other Blog Posts