Improving models by fixing object detection labels#

Label quality is one of the major factors that affects your model’s performance. 3LC provides valuable insights into your label quality and singles out problematic labels. By fixing the label errors and retraining on the fixed dataset, you have a great chance of improving your model.

In this tutorial, we will cover the following topics:

  • Installing 3LC-integrated YOLOv8

  • Running YOLOv8 training and collecting metrics

  • Opening a Project/Run/Table in the 3LC Dashboard

  • Finding and fixing label errors in the 3LC Dashboard

  • Comparing original and retrained models’ performance

Installing 3LC-integrated YOLOv8#

The 3LC integration with YOLOv8 is distributed separately from 3LC on GitHub. To install it, you may run these commend lines.

git clone https://github.com/3lc-ai/ultralytics.git
cd ultralytics
python -m venv .venv
.venv/Scripts/activate
pip install -e .
pip install 3lc

For more details about 3LC-integrated YOLOv8, please visit the GitHub repository.

Running YOLOv8 training and collecting metrics#

In order to run training with the integration, instantiate a model via TLCYOLO instead of YOLO and call the method .train() just like you are used to. Here is a simple example, which shows how to specify 3LC settings.

from ultralytics.utils.tlc.detect.model import TLCYOLO
from ultralytics.utils.tlc.detect.settings import Settings

# Set 3LC specific settings

settings = Settings(
    project_name="hardhat-project",
    run_name="base-run",
    run_description="base run for BBs editing",
    image_embeddings_dim=2,
    collection_epoch_start=4,
    collection_epoch_interval=5,
    conf_thres=0.4,
)

# Initialize and run training
model = TLCYOLO("yolov8n.pt")
model.train(data="hardhat.yaml", epochs=20, settings=settings)

The Runs and Tables will be stored in the project folder “hardhat-project”. For your first run, 3LC creates the tables for the training and validation sets provided through the data argument. For later runs, it will use the existing tables. Once you create new data revisions (we will cover this in the next section), the code will automatically pick up the latest revisions for training and metrics collection.

Opening a Project/Run/Table in the Dashboard#

In this section, we will get you familiarized with the 3LC Dashboard. The project “hardhat-demo” in the distributed public examples will be used throughout this tutorial.

To access the 3LC Dashboard, first start the 3LC Service in a terminal.

3lc service

Then, you can open the Dashboard in a browser at dashboard.3lc.ai. The Dashboard consists of three panels – Filters (left), Charts (upper right), and Rows (lower right) panels - across various pages such as Runs or Tables page.

In the Dashboard homepage (Projects page), double click on the project “hardhat-demo” in the project list. You will see a list of the 3 Runs in the “hardhat-demo” project.

Double click on any one of them to open it and view the metrics collected.

Finding and fixing label errors in the Dashboard#

We will use the Run “base-run”, which was trained on the original training set, to demonstrate how to find potential label errors including missing labels and inaccurate bounding boxes (BB) by using the model’s metrics.

Double click the Run “base-run” to open it. Each displayed row represents the metrics collected for a single sample. Each column has a filter widget in the Filters panel.

Let’s first create a chart showing an image overlaid with both its ground truth and predicted BBs. To make the chart, select the Image column, Ctrl + LeftClick the BBS and BBS_predicted columns, and then press 2. In the chart, solid boxes are ground truth labels and dashed boxes are the model’s predictions. We will use this chart to visualize the filtered-in BBs after applying the filters.

For this demo, we will focus only on the training set. Therefore, filter on the “hardhat-train” table in the Foreign Table filter to follow along.

Find missing labels#

Adjust the IOU filter to the range [0.0, 0.2]. By applying this filter, only predicted BBs that have very little or no overlap with ground truth BBs will be visible. These filtered-in predictions could potentially correspond to missing labels. Under the label_predicted filter, the first numbers next to each class name represent the number of filtered-in BBs, while the second numbers are the total in the unfiltered dataset. That is, 717 helmet and 538 head BBs (total 1225) satisfy the applied filters, and they are in the 868 samples (out of total 7035), indicated on the top of the METRICS panel.

We then can play with the confidence filter along with IOU. In general, highly confident false positive (FP) predictions have a higher chance of being missing labels rather than real FPs, while low-confidence FP predictions may be a mixed bag of missing labels and real FPs. With some manual checking, this dataset conforms to this trend as shown in the table below.

Confidence

% of filtered-in BBs

Missing labels/filtered-in BBs

>0.6

40%

80%

<0.6

60%

60%

Therefore, we can add the high-confidence low-IOU predictions into the ground truth dataset. Filter confidence to be >0.6, right click BBS_PREDICTED inside the chart, and click “Add 473 predictions (369 rows)…”. By doing this batch assignment, we also added the 20% real FPs into the labels, so that we might want to quickly scan through those just-added labels and remove the unwanted ones (i.e., real FPs).

Next, we would like to manually go through the BBs with confidence<0.6 since a sizeable portion of them are real FPs that we don’t want to add to the labels. To do this, filter confidence to be <0.6, flick through each sample, add missing labels and leave real FPs at your discretion. To add individual predictions, right click on the predicted BB and then click “Add prediction”.

Find inaccurate BBs#

Inaccurate BBs are those that have big size/location differences from the predicted counterparts (i.e., smaller IOU), assuming that the predicted ones are more accurate. Those inaccurate BBs can not only potentially undermine model’s performance, but also affect TP/FP/FN counts when the IOUs are around the IOU threshold for TP/FP/FN calculations. To find the inaccurate BBs, first clear all existing filters (icon on the top of Filters panel) to start fresh, then filter “hardhat-train” in Foreign Table (again) and IOU to be in range [0.4, 0.8]. A total of 8100 BBs in 3672 images are filtered in.

For these inaccurate BBs, we want to replace the existing ones with the model’s predictions. To do that, we first need to set “Max IOU” under BBS in the chart from default 1 to 0.4 (the low end of the previous IOU filter range) as shown in the figure below. This parameter sets the IOU threshold to replace any existing BBs that have IOU>0.4 with the predicted BB, which will be added as the new label.

Then, we can do the batch assignment by clicking “Add 8100 predictions (3672 row…” under BBS_PREDICTED and then clicking OK in the popup dialog box. You notice that the new labels (same as the predictions) in this displayed sample have replaced the old ones.

Finally, we would like to save all edits as a new revision. Click the pen icon on upper right and click “Commit” to make a new revision, which can be directly used for retraining later.

It is worth noting that iterating the label editing – retraining workflow a few times may be needed to reach optimal results. In fact, we have done several revisions for preparing this “hardhat-demo” project. Some intermediate Runs and data revisions are not included in the project for the sake of simplicity.

Comparing original and retrained models’ performance#

After a few iterations of label editing and retraining, we end up with a pretty decent revision. Now we can run training on the newest revision (“retrained-run”) and the original one (“original-run”) to compare if the model is improved by data editing. From the metrics charts below, you can see that “retrained-run” is better than “original-run” across the board. Note that the metrics collected for both Runs are against the same revision of the val set rather than the original val set.

Here are the F1 scores for the two Runs.

Run

F1

Original-run

0.941

Retrained-run

0.959

Finally, let’s take a look at some samples to see if and how the model is getting better with revised train data. In the images below, left is from “original-run” and right is from “retrained-run”. It is evident that the retrained model is indeed much better than the original model.

Feel free to further compare the two Runs’ results on your own. Click on “original-run” and Ctrl + double-click on “retrained-run” to open both Runs in the same session. Note that comparing these runs to “base-run” will result in the metrics being split into two different tabs, which is not ideal for comparison. This is due to the “base-run” being based on a different model, using different metrics-collection settings.

In the Run’s page (figure below), it indicates “2 Runs” on upper left and in the RUN filter. To get the RUN column in the metrics table, you can toggle RUN in the dropdown menu by clicking the wrench icon on the tool bar of the METRICS panel. This RUN column will make it easy to compare metrics from the same sample between Runs. You can also sort the Example_ID column to make the same sample next to each other, so that it’s convenient to compare them across the Runs. Filer on the “hardhat-val” to compare the val set only. The confidence-IOU scatter chart in the figure below shows “retrained-run” has more clustered predictions with high confidence and high IOU, which aligns with other observations. To get this plot, select Confidence, IOU, and Run columns in sequence and press 2.

Summary#

In this tutorial, we covered several topics including installing 3LC-integrated YOLOv8, training YOLOv8 models and collecting metrics, exploring 3LC Dashboard, finding and fixing label errors, and analyzing Runs to see model’s improvements. We demonstrated that models can be improved by fixing label errors.