Object detection in an hour
Building object detection models can be very painful. If you are lucky, a pretrained YOLOv3 or Detectron2 model will work just fine. Oftentimes you will need a class that does not exist in an existing dataset or you need to further tune the models for your specific operating conditions and sensors. While setting up a model and pipeline requires some computer vision and machine learning knowledge, collecting and annotating the training data is still by far the most work intensive part of the process.
To train a model, you will need thousands of examples, preferably hundreds of thousands from your specific domain. This can easily mean weeks of manual person hours of work.
By using the right tools, this time can be cut down from days to hours (or in our case a single lunch break).
Here we describe how we used the Stray software stack to create a custom electric scooter detection system in just an hour of active time.
We figured that electric scooters which have taken over cities worldwide, would serve as a perfect example of an object detection target. The resulting detector could be used in a robot that puts them where they belong. It's also a great example of an object that does not have a corresponding category in the popular COCO dataset, which is often used to pre-train and evaluate many of the popular object detection models.
In order to train a custom object detection model, the amount of data needs to be in the order of thousands, preferably tens of thousands of images. This amounts to a few minutes of video data at 30-60 fps. It is important to capture the target in many different conditions and from all viewpoints, so it makes sense to capture many short clips with different contexts instead of a single video from the same context.
Fortunately, finding electric scooters in different contexts was not an issue for us, since the path to our lunch place was paved with dozens of scooters. We went ahead and captured a dataset of all the scooters we encountered using the Stray Scanner app. The app captures RGB and depth frames at 60fps. In total, we recorded 25 different clips which sums up to a total of about 15 000 images.
While we used the Stray Scanner app, any RGB-D sensor can be used that creates both an RGB frame and a depth map.
From raw video to annotated data
At this point, we would typically load the data into an annotation tool such as MakeSense or SuperAnnotate and begin a tedious process of labeling images one by one, adding bounding boxes or segmentation masks that cover the objects. Alternatively, we could send the images to a manual labeling service such as Segments.ai, but the costs can pile up really quickly, depending on the size of the dataset and annotation type.
To avoid this, we import the collected data from the Stray Scanner app into our data format. We integrate the frames into a 3D reconstruction and extract camera poses using the Stray Command Line Tool with the simple stray studio integrate command. The output is the trajectory of the camera and a triangle mesh representing the scene in 3D, which we can open in the Stray Studio interface.
In Stray Studio, we can place bounding boxes and keypoints in 3D space. Once we are happy with the 3D annotations, we can project the bounding boxes and keypoints back onto each 2D images from which the scene was constructed. Annotating in 3D additionally allows us to compute 3D labels, in each image's coordinate frame, which is useful for solving 3D computer vision tasks.
Getting to this point in the process has taken us about an hour of active work, several orders of magnitude less than what it would have taken us, had we labeled each image one by one.
Data loading and model training
The desired labels do not need to be produced and exported in advance, but instead we can configure the desired label type (bounding box, segmentation mask) come model training time. For our electric scooter detection, we chose to use the Detectron2 library. The Stray Command Line Interface provides easy to use utilities for model definition and baking (i.e. fine tuning and training, check out the documentation for further details).
We train our model to detect 2D bounding boxes around the scooters from RGB images. The qualitative results shown above are after training for 25 000 iterations and evaluating on data not included in the training set. Even though the evaluation data is from a completely different context on a sunny day, the model generalizes well to different types of scooters and also is robust to the camera shaking and to scooters being partially occluded.
We are currently building the next round of demos, which will include extracting and infering semantic segmentation masks. Stay tuned for more in the upcoming weeks!
Using the tool yourself
We have made all the tools mentioned in this post available for everyone to use, free of charge for now. Install the tools by following our installation guide. Currently, we support macOS and Linux platforms.
Moving forward, we will be expanding our toolkit to make it even more powerful and versatile.
If you have a use case in mind, have improvement ideas or anything else in mind, we would very much like to hear from you!
Meanwhile, you should subscribe to our newsletter, to follow us as we develop a simple to use toolkit for solving computer vision problems.