Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Shift happens: Crowdsourcing metrics and test datasets beyond ImageNet

Growing ObjectNet: Adding speech, VQA, occlusion, and measuring dataset difficulty

David Mayo · David Lu · Chris Zhang · Jesse Cummings · Xinyu Lin · Boris Katz · James Glass · Andrei Barbu


Abstract:

Building more difficult datasets is largely an ad-hoc enterprise, generally relying on scale from the web or focusing on particular domains thought to be challenging. ObjectNet is an attempt to create a more difficult dataset, one that eliminates biases that may artificially inflate machine performance, in a systematic way. ObjectNet images are meant to decorrelate objects from their backgrounds, have randomized object orientations, and randomized viewpoints. ObjectNet appears to be much more difficult for machines. Spoken ObjectNet is a retrieval benchmark constructed from spoken descriptions of ObjectNet images. These descriptions are being used to create a captioning and VQA benchmark. In each case large performance drops were seen. The next variant of ObjectNet will focus on real-world occlusions since it is suspected that models are brittle when shown partially-occluded objects. Using large-scale psychophysics on ObjectNet we have constructed a new objective difficulty benchmark applicable to any dataset: the minimum presentation time for an image before the object contained within it can be reliably recognized by humans. This difficulty metric is well predicted by quantities computable from the activations of models, although not necessarily their ultimate performance. We hope that this suite of benchmarks will enable more robust models, prove better images for neuroscientific and behavioral experiments, and contribute to a systematic understanding of the dataset difficulty and progress in computer vision.

Chat is not available.