NVIDIA TAO Toolkit: How to Build a Data-Centric Pipeline to Improve Model Performance

During this series, we will use 91��ƵAPP to build a Data-Centric pipeline to debug and fix a model trained with the NVIDIA TAO Toolkit.

‍

. We demystify the NVIDIA ecosystem and define a Data-Centric pipeline based on a model trained with the NVIDIA TAO framework.

‍

. Using the 91��ƵAPP API, we show you how to upload a dataset and a model to the 91��ƵAPP platform.

‍

Part 3 (current). We identify failures in our model due to data issues, fix these failures and improve our model’s performance with the fixed dataset.

‍

💡Don’t miss our previous series on NVIDIA TAO Toolkit: , & .

‍

Recap from Part 2
Finding model failures
Improving performance by fixing a dataset
Conclusion

‍

1. Recap from Part 2

In the second post of this series we focused on two main things:

We set up a data storage provider.
We established a workflow to ingest our data into the 91��ƵAPP platform using the 91��ƵAPP API.

‍

Before continuing, make sure that you have the following:

A dataset and a model on your 91��ƵAPP account (see Figure 1)

Figure 1. A 91��ƵAPP account with a dataset and a model after being ingested

‍

2. Finding model failures

Figure 2. 91��ƵAPP Data-Centric pipeline roadmap

‍

2.1 mAP Analysis

In the first stage of our Data-Centric pipeline we ingested our dataset (see Figure 2). Now, let’s focus on the next stage: mAP analysis.

‍

💡For a refresher on mean average precision (mAP), visit , where we carefully explain how this metric is computed and demystify a few misunderstandings around it.

‍

Figure 3. Mean average precision (mAP) of the crosswalks class

‍

On the 91��ƵAPP dashboard, we select Model Comparison on the left menu. From there we can quickly obtain a picture of the per-class performance as shown on Figure 3.

‍

The very first insight we obtain from this view is the poor performance of the “crosswalks” class:

The “crosswalks” class has the lowest performance: ~0.26 mAP.
The “crosswalks” class has the lowest recall: ~0.40 mAR.
Out of 154 annotations for the “crosswalks” class, the model is predicting 170 objects in total.

‍

Based on these findings, we can start to make some hypotheses 👩‍🔬:

The model has both a precision and a recall problem: it erroneously predicts a high number of “crosswalks” objects, and it fails to detect many ground truth “crosswalks” objects.
This model behaviour could be attributed to some of the following issues: label quality, edge cases, or mispredictions, among others.

‍

2.2 Data imbalance

Now, to evaluate data imbalance issues, let’s first verify the distribution of our annotations and our predictions.

Figure 4. Frequency of annotations and predictions

As reflected on Figure 4, the dataset, in general, is imbalanced. For instance, the “vehicles” class contains 10 times more annotations than the “bicycles” class.
Despite the imbalance, the class with the lowest number of annotations (i.e., “bicycles” class) performs better than the “crosswalks” class.
The “crosswalks” class is on par with the “motorcycles” class in number of annotations, yet the performance of the former is considerably lower: out of 154 “crosswalks” objects, the model is wrongly predicting the majority of them.

‍

🤔 Can we simply add more samples of the “crosswalks” class to balance our data? The answer is NO. We need to add the right kind of “crosswalks” examples (or we risk obtaining the same performance). Before identifying the kind of samples we need, we must find why our model is failing to predict and to detect this class.

‍

💡Hint: We often see an imbalance issue and believe that the first thing we need to do is to apply data augmentation techniques. But, contrary to popular belief, this is often the wrong approach. We show one example of this scenario on this .

‍

2.3 Class selection

Once we have run the previous two stages of our Data-Centric pipeline, we can see with more clarity where and how to start:

We will fix the “croswalks” class because it has the widest room for improvement in performance (i.e., 80/20 rule: a successful yet tiny improvement can yield the highest reward).
We can analyze the False Negative and the False Positive examples to make some hypotheses of what is wrong with our “croswalks” class.
We will search for label quality issues reflected in errors such as mispredictons or mislabelled examples.

‍

Now, imagine we have built our own mAP and mAR tooling, or let’s say we are using W&B to plot precision-recall curves. We might have arrived to similar conclusions as in Section 3.1, and 3.2 but now, how can we find the root causes of our data issues?

‍

2.4 Failure inspection

Figure 5. Multi-Class confusion matrix for object detection

‍

91��ƵAPP multi-class confusion matrix provides us with a summary of errors such as False Negative, False Positive and Mispredictions.

‍

On Figure 5, we observe the following:

84 “croswalks” objects are undetected (i.e., false negatives).
107 “croswalks” objects are incorrectly predicted (i.e., false positives).
6 “croswalks” objects are misprediected.
63 “croswalks” objects are true positives.

‍

Let’s analyze the false negative examples using the Data Explorer:

Figure 6. Analyzing undetected objects in our dataset

‍

Figure 6 provides key insights on the type of undetected objects where the model is failing:

Samples with fading crosswalks.
Crosswalks that are in a diagonal position.
Crosswalks that are behind other objects (i.e., occlusion).

2.5 Using multi-modal search to find similar errors

Next, let’s double down on the fading crosswalks as one of the potential data issues!

‍

Figure 7. Using text search to retrieve “fading crosswalks”

‍

On Figure 7, we use a combination of error filters and text search to retrieve similar examples. These retrieved examples are grouped in a Data Slice (see Section 2.6 for more details).

‍

We can observe that many of the retrieved examples are in fact undetected objects.

‍

Our model is failing at detecting fading crosswalks, but the interesting questions is now: How can we quantify how bad our model is failing due to the fading crosswalks?

‍

Let’s use 91��ƵAPP Data Slices to answer the above question!

‍

2.6 91��ƵAPP Data Slices: evaluate performance on a subset of our data

After forming one potential h

91��ƵAPP