Toy Leksut | viva

Visual Variation Learning for Object Recognition

We propose visual variation learning to improve object recognition with convolutional neural networks (CNN). While a typical CNN regards visual variations as nuisances and marginalizes them from the data, we speculate that some variations are informative.
We study the impact of visual variation as an auxiliary task, during training only, on classification and similarity embedding problems.
Our key contribution is that, at the cost of visual variation annotation during training only, CNN enhanced with visual variation learning learns better object representations.

Gradient-weighted class activation mapping (Grad-CAM) on iLab-2M test data. Both object instances are misclassified as car on ResNet without variation learning and correctly classified as van and monster on ResNet with variation learning (ResNet+Pose). The heatmaps on the second row highlight regions of the image that activate the incorrect class on ResNet and on the third row hightligh regions that activates the correct class on ResNet+Pose. ResNet+Pose is more tuned to the shape of an object across poses and focuses on distinctive features of each category (flat front of the van and oversized wheels of the monster truck).

Visual Variation Learning for Object Recognition. Jatuporn Toy Leksut, Jiaping Zhao, Laurent Itti; Image and Vision Computing, 2020

[DOI: 10.1016/j.imavis.2020.103912][Preprint]

Data

iLab-20M

The iLab-20M dataset is a large-scale controlled, parametric dataset of toy vehicle objects under variations of viewpoint, lighting, and background. The dataset is produced by placing a physical object on a turntable and using multiple cameras located on a semicircular arc over the table.

15 categories: boat, bus, car, equipment, f1car, helicopter, military, monster truck, pickup truck, plane, semi truck, tank, train, UFO, and van
718 object instances
88 different viewpoints (11 elevations x 8 azimuths)
5 lighting conditions
3 camera focus settings
14–40 background images
22 million images total

The iLab dataset provides toy vehicle images from 15 categories in various viewpoints, lighting conditions, and backgrounds.

iLab-20M: A Large-Scale Controlled Object Dataset to Investigate Deep Learning. Ali Borji, Saeed Izadi, Laurent Itti; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

[DOI: 10.1109/CVPR.2016.244] [Open Access]

iLab-80M

The iLab-80M is an augmented set of the iLab-20M adding random crops and scales. The augmentation ensures that the number of images per category is well balanced, resulting in 5.5 million images per category and a total of 82.7 million images for the whole set.

In addition, the original 960x720 images are cropped around each object and rescaled to 256x256.

iLab-2M

The iLab-2M is a subset of iLab-80M sampled for experiments conducted in this work. In iLab-2M, data vary only in pose variation, while other visual variations are kept constant.

30 poses (5 elevations x 6 azimuths)
1.2M training images, 270K validation images, 270K test images

Download iLab-2M

[iLab2M Summary]

iLab-2M-Light

The iLab-2M-Light is an extension of the iLab-2M that includes lighting conditions as an additional visual variation.

30 poses (same as iLab-2M)
5 lighting conditions
1.36M training images, 316K validation images, 316K test images

Download iLab-2M-Light

Methods

1. Variation Classification

A multi-layer variation-injected CNN built on top of AlexNet. AlexNet is enclosed inside the left box. All additional components on the right are part of the variation classification module. Dotted lines represent injection connections that connect intermediate conv/pool layers to transformation (fc) units. Outputs of transformation units are summed into variation scores.

2. Variation Embedding Learning

A conceptual diagram of an embedding network. The main AlexNet network takes an anchor input and produces class probability scores. For Siamese setting, the second network in the middle takes a second input of the same object instance but with either similar or dissimilar pose. For triplet setting, the second and the third network takes one similar pose and one dissimilar pose. Both auxiliary networks output an embedding vector representing a point in pose embedding space.

Visual Variation Learning for Object Recognition

Data

iLab-20M

iLab-80M

iLab-2M

iLab-2M-Light

Methods

1. Variation Classification

2. Variation Embedding Learning

Experimental Results

Multi-Layer Injections

Loss Weight Assignment

Extension to ResNet and DenseNet

Extension to Lighting Variation

Variation Embedding Learning