We consider the problem of segmenting image regions given a natural language phrase. We propose a larger-scale dataset that covers more concepts and allows more flexible target regions. Our modular method enables uniform treatment of things and stuff and improves performance on rare categories.
We use attribute phrases to describe fine-grained visual differences between instances. We collect a large dataset, design a listener and a speaker to play the reference game, and apply the trained models to fine-grained classification as well as telling the difference between categories.