TLDR: Yes, you probably should
Let's say you’re starting an exciting new object detection project. You’ve defined the problem, met with stakeholders, collected data, and now you’re ready for the tedious part of machine learning projects: labelling the data. This is often a pretty mindless, standard procedure that doesn’t vary much between projects.
Typically, all objects of interest are labelled, which is easy when labelling images with only a small number of objects in them. However, I ran into a problem regarding this process on a recent project I was doing that involved counting sheep as they’re being unloaded from a truck. In addition to the sheep in the foreground of the image (that will be labelled), there is often a large herd of sheep in the background (that won’t be labelled due to time). The question arises then: are these unlabelled sheep going to degrade the performance of the detection model?
An example (potentially problematic) labelled image is shown below:
Note that only the two sheep running down the ramp are labelled, even though there are 10s of sheep in the background; it would be too labor-intensive to label all the background sheep for the training images, and these aren’t particularly important for the model to detect anyway.
This scenario may cause problems for the model in one or both of the following ways:
- ‘Confuse’ the model, as it is being taught to detect some sheep and not others, with the only difference being some sheep are on the race, while others are in a pen
- Cause the model to generalise poorly. If the model was able to converge nicely on the dataset, it may generalise poorly to other sheep-counting scenarios, as it’s learned a bias to only detect sheep when they’re running through a race so it may ignore sheep that are milling about in a group, as was the case in the background of the training images.
To address this concern, a simple experiment was done.
- Create a masked dataset, in which the unlabelled sheep are covered by a mask of some sort (3 types are experimented with)
- Train an object detection model on each of these datasets until convergence (as measured by a separate validation set)
- Use these models to count sheep on a completely separate testing dataset of sheep.
A COCO pretrained ResNeXt-101 was used as the detection model, using FAIR’s Detectron2 framework. The validation set used during training was from the same distribution as the training dataset (sheep being offloaded from a truck). The testing dataset that is used to measure each model’s counting accuracy is from a completely different distribution. The images in this dataset are from a different farm, with the sheep simply milling about in their herd. They tend to group up, which is why this is a good dataset to compare the herd-counting accuracy of these models. The fact that it’s from a separate location also tests the models’ generalisation ability.
The research question being addressed is:
Do unlabelled objects interfere with training an object detection model and should they therefore be masked out?
Four training datasets were created for this experiment, each using the same random training/validation split:
- Control (top left) - No changes were made to the images
- Black Mask (top right) - Areas that background sheep occurred in were covered by a black mask. This acts as a simple baseline approach.
- Random Color Mask (bottom left) - Background sheep areas were covered by a randomly covered mask. This specific dataset aims to answer the question of whether the mask color makes any noticeable difference (unlikely, but worth investigating).
- Real Image Mask (bottom right) - Background sheep areas were covered by a mask taken from a real image (grass, concrete, etc). Masking out the background with a single color makes unrealistic images, so this dataset aims to investigate whether creating more realistic masked images is a better option.
Side note: This is a somewhat niche problem, and obviously the best option would be to simply label all objects in the dataset that should be detected. This research is geared towards individuals/small companies that don’t have the resources to scale up their labelling effort to that level.
Examples of these images are shown below:
|Control||Mask - Black||Mask - Random||Mask - Real|
Table 1: MAE on the test set of model trained on each dataset. Model chosen by early stopping. Lower is better.
Table 1 shows the resulting mean average error (MAE) of the count (how many sheep it detected in the image vs how many there really were) given by models trained to convergence (i.e. stopped once validation performance stops increasing) on each respective dataset.
As expected, the control performed the worst with a MAE of 0.936, indicating that the background sheep are likely negatively affecting the model’s performance.
The best model was that which was trained with randomly colored masks. The difference between it and the black-mask model was apparent, but not as significant a difference as between the control and random models.
Surprisingly, the model trained with masks created by real images performed slightly worse than using black or randomly colored masks, with a MAE of 0.665. The reason for this isn’t entirely clear; perhaps the model was leveraging these random backgrounds - despite them not having any sheep in them - for prediction, or perhaps eliminating all background information (via a single color) eliminates distraction and allows the model to focus solely on the sheep to be detected.
Using a single model on a single dataset gives us only a small insight into the results. To validate the difference between these models further I evaluated a set of models at different training iterations on the testing dataset. 9 models were evaluated on the test set, with their MAE values being averaged over the 9 results. The 9 models were taken every 300 iterations from 4500 to 6900. This range was chosen as models tended to converge around this stage of training. Averaging values over a set of models somewhat eliminates the noise in the results that occurs from training the models; model performance varies chaotically between iterations (doesn’t monotonically improve), so averaging a set of values smooths out these differences.
|Control||Mask - Black||Mask - Random||Mask - Real|
Table 2: Average MAE of models obtained from iterations 4500-6900 (average of 9 values)
Table 2 shows the results of this experiment (full table in appendix). Largely the story is the same: the control (no masking) model performed the worst, while masking background sheep with a randomly colored masks provides the most significant boost in performance (47% improvement in MAE). As the values being averaged are not completely independent, performing a paired t-test between them is somewhat disingenuous; this makes it difficult to make any strongly founded claims about the results. Despite this, the averaged performance score trends matching the trends found in Table 1 strengthens the claim that unlabelled background objects negatively affect model performance, and so they should be masked out. The MAE for masking via a black mask or random colored mask are similar, however, in this case the randomly colored mask performed better and so will be used for the final training dataset in the project.
Overall, the outcome of the research was positive. The hypothesis, that unlabeled objects can interfere with detection models and should therefore be masked out, has been shown to be true. Labeling all objects of interest should be done if the capacity is there, however in lieu of that, simply hiding the unlabeled objects is the next best option. Masking background objects can be done incredibly quickly as most images need only a single polygon drawn.
These experiments were only done on a single dataset with one model architecture, so the results may not hold in all cases and shouldn't be taken as gospel. Despite this, masking has been shown to improve the model’s performance and so probably should be done for cases similar to the one shown.