[Paper Read]Learning From Noisy Large-Scale Datasets With Minimal Supervision

Paper
Andreas Veit,Neil Alldrin Gal Chechik, Ivan Krasin, Abhinav Gupta, Serge Belongie1

Problem Definition

To leverage a small set of clean labels in the presence of a massive dataset with noisy labels.

Problem: Large noisy annotation data
DataSet: Open Images (~9 million images), multiple-labeled image and over 6000 unique classes.
Application: Object Classification
Assumption(Limitation): data with large number of classes and wide range of noise in annotations

Introduce a semi-supervised learning framework which can produce a cleaned version of the dataset and a robust multi-label image classifier that facilitates small sets of clean annotations with large noisy data.
Outperforming direct fine-tuning approaches across all major categories in the Open Image dataset.
Improving performance across the full range of label noise levels(even in limited rated data), and most effective for classes having 20% to 80% false positive annotations.

Use the small clean dataset to learn a mapping between between noise and clean annotation.

Two supervised-learning network in this model.
The first is “label cleaning network” , which input the set of noisy label and image feature extract by CNN(Inception V3) , output cleaned label set to supervise the second network.
The label cleaning network is a residual model learning the difference between clean and noisy data. The model use identity skip-connection structure ( inspired by ResNet V1).
The second is “multi-label classifier” taking the label predicted by the first network as ground truth if the image pair doesn’t have a clean label.