Reasoning about Fine-grained Attribute Phrases using Reference Games




We present a framework for learning to describe fine-grained visual differences between instances using attribute phrases. Attribute phrases capture distinguishing aspects of an object (e.g., "propeller on the nose" or "door near the wing" for airplanes) in a compositional manner. Instances within a category can be described by a set of these phrases and collectively they span the space of semantic attributes for a category. We collect a large dataset of such phrases by asking annotators to describe several visual differences between a pair of instances within a category. We then learn to describe and ground these phrases to images in the context of a reference game between a speaker and a listener. The goal of a speaker is to describe attributes of an image that allows the listener to correctly identify it within a pair. Data collected in a pairwise manner improves the ability of the speaker to generate, and the ability of the listener to interpret visual descriptions. Moreover, due to the compositionality of attribute phrases, the trained listeners can interpret descriptions not seen during training for image retrieval, and the speakers can generate attribute-based explanations for differences between previously unseen categories. We also show that embedding an image into the semantic space of attribute phrases derived from listeners offers 20% improvement in accuracy over existing attribute-based representations on the FGVC-aircraft dataset.


Reasoning about Fine-grained Attribute Phrases using Reference Games,
Jong-Chyi Su*, Chenyun Wu*, Huaizu Jiang, and Subhransu Maji
International Conference on Computer Vision (ICCV), 2017
pdf, arXiv, supplementary material (4.1MB), bibtex, poster (16MB), slides (12MB)

Dataset and Source Code

github page:

More Figures of Our Results in High Resolution

More examples from our dataset
More results of generated attribute phrases
Set-wise attribute phrases (11MB)
t-SNE embedding of attribute phrases with simple listener model
t-SNE embedding of attribute phrases with discerning listener model
t-SNE embedding of images with simple listener model

Image Retrieval Demo

Click here for image retrieval demo with attribute phrases using the listener model


This research was supported in part by the NSF grants 1617917 and 1661259, and a faculty gift from Facebook. The experiments were performed using equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Tech Collaborative and GPUs donated by NVIDIA.